Optimized GPU implementation of functions containing matrix-matrix operations



C = gpucoder.matrixMatrixKernel(FUN,A,B) generates kernels from functions that contain GEMM-like operations. For example, matching feature points between two images by using:

  • The sum of absolute differences (SAD) — F() = @(a,b)abs(a-b)

  • The sum of squared differences (SSD) — F() = @(a,b)(a-b).*(a-b)

FUN is a handle to a user-defined function. It takes one row and column from matrix A and one row and column from matrix B and outputs a vector with the same type as the input. The output vector is then summed to compute a single scalar value in C. Numeric inputs A and B must be either of the same size or have sizes that are compatible. For example, if A is an M-by-K matrix, B is a K-by-N matrix then C is an M-by-N matrix.

C = gpucoder.matrixMatrixKernel(FUN,A,B,orientation) has the optional argument orientation that specifies the orientation of A and B matrices. It can take one of four possible values:

  • 'nn' - Matrices A and B are normal.

  • 'nt' - Matrix B is transposed.

  • 'tn' - Matrix A is transposed.

  • 'tt' - Both matrices A and B are transposed.


collapse all

This example performs a simple matrix-matrix multiplication and uses the matrixMatrixKernel design pattern to generate CUDA® code.

In one file, write an entry-point function matMul_nn that accepts two matrix inputs f1 and f2. Use the MATLAB® function @times to multiply f1 and f2 element by element. The sign @ creates a handle to the function times. Insert the gpucoder.matrixMatrixKernel() statement. The input matrices are not transposed, therefore use the 'nn' option.

function scores = matMul_nn(f1, f2)
    scores = gpucoder.matrixMatrixKernel(@times, f1, f2, 'nn');

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex') ...
    -args {ones(1024,1024,'double'),ones(1024,1024,'double')} ...
    -report matMul_nn

The generated CUDA code contains two kernels: matMul_nn_kernel1 for initializing the output matrix scores and matMul_nn_kernel2 that performs the times operation. The following is a snippet of the generated code.

  matMul_nn_kernel1<<<dim3(2048U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_scores);
  cudaMemcpy(gpu_f2, f2, 8388608U, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_f1, f1, 8388608U, cudaMemcpyHostToDevice);
  matMul_nn_kernel2<<<dim3(16U, 16U, 1U), dim3(16U, 16U, 1U)>>>(gpu_f2, gpu_f1,
  cudaMemcpy(scores, gpu_scores, 8388608U, cudaMemcpyDeviceToHost);

matMul_nn_kernel2 has 2-D grid of 2-D blocks. The kernel has 16x16 blocks with 256 threads per block.

Introduced in R2017b