# gpucoder.matrixMatrixKernel

Optimized GPU implementation of functions containing matrix-matrix operations

## Syntax

``C = gpucoder.matrixMatrixKernel(FUN,A,B)``
``C = gpucoder.matrixMatrixKernel(FUN,A,B,orientation)``

## Description

example

````C = gpucoder.matrixMatrixKernel(FUN,A,B)` generates kernels from functions that contain GEMM-like operations. For example, matching feature points between two images by using: The sum of absolute differences (SAD) — ```F() = @(a,b)abs(a-b)```The sum of squared differences (SSD) — ```F() = @(a,b)(a-b).*(a-b)``` `FUN` is a handle to a user-defined function. It takes one row and column from matrix `A` and one row and column from matrix `B` and outputs a vector with the same type as the input. The output vector is then summed to compute a single scalar value in `C`. Numeric inputs `A` and `B` must be either of the same size or have sizes that are compatible. For example, if `A` is an `M`-by-`K` matrix, `B` is a `K`-by-`N` matrix then `C` is an `M`-by-`N` matrix.`C = gpucoder.matrixMatrixKernel(FUN,A,B,orientation)` has the optional argument `orientation` that specifies the orientation of `A` and `B` matrices. It can take one of four possible values:`'nn'` - Matrices `A` and `B` are normal.`'nt'` - Matrix `B` is transposed.`'tn'` - Matrix `A` is transposed.`'tt'` - Both matrices `A` and `B` are transposed.```

## Examples

This example performs a simple matrix-matrix multiplication and uses the `matrixMatrixKernel` design pattern to generate CUDA® code.

In one file, write an entry-point function `matMul_nn` that accepts two matrix inputs `f1` and `f2`. Use the MATLAB® function `@times` to multiply `f1` and `f2` element by element. The sign @ creates a handle to the function `times`. Insert the `gpucoder.matrixMatrixKernel()` statement. The input matrices are not transposed, therefore use the `'nn'` option.

```function scores = matMul_nn(f1, f2) scores = gpucoder.matrixMatrixKernel(@times, f1, f2, 'nn'); end```

Use the `codegen` function to generate CUDA MEX function.

```codegen -config coder.gpuConfig('mex') ... -args {ones(1024,1024,'double'),ones(1024,1024,'double')} ... -report matMul_nn```

The generated CUDA code contains two kernels: `matMul_nn_kernel1` for initializing the output matrix `scores` and `matMul_nn_kernel2` that performs the `times` operation. The following is a snippet of the generated code.

``` matMul_nn_kernel1<<<dim3(2048U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_scores); cudaMemcpy(gpu_f2, f2, 8388608U, cudaMemcpyHostToDevice); cudaMemcpy(gpu_f1, f1, 8388608U, cudaMemcpyHostToDevice); matMul_nn_kernel2<<<dim3(16U, 16U, 1U), dim3(16U, 16U, 1U)>>>(gpu_f2, gpu_f1, gpu_scores); cudaMemcpy(scores, gpu_scores, 8388608U, cudaMemcpyDeviceToHost);```

`matMul_nn_kernel2` has 2-D grid of 2-D blocks. The kernel has 16x16 blocks with 256 threads per block.