Kernels from Scatter-Gather Type Operations

GPU Coder™ also supports the concept of reductions - an important exception to the rule that loop iterations must be independent. A reduction variable accumulates a value that depends on all the iterations together, but is independent of the iteration order. Reduction variables appear on both side of an assignment statement, such as in summation, dot product, and sort. The following example shows the typical usage of a reduction variable x:

x = ...; % Some initialization of x
for i = 1:n
  x = x + d(i);

The variable x in each iteration gets its value either before entering the loop or from the previous iteration of the loop. This serial order type implementation is not suitable for parallel execution due to the chain of dependencies in the sequential execution. An alternative approach is to employ a binary tree-based approach.

In the tree-based approach, you can execute every horizontal level of the tree in parallel over a certain number of passes. When compared to sequential execution, the binary tree does require more memory because each pass requires an array of temporary values as output. The performance benefit that you receive far outweighs the cost of increased memory usage. GPU Coder creates reduction kernels by using this tree-based approach wherein each thread block reduces a portion of the array. Parallel reduction requires partial result data exchanges between thread blocks. In older CUDA® devices, this data exchange was achieved by using shared memory and thread synchronization. Starting with the Kepler GPU architecture, CUDA provides shuffle (shfl) instruction and fast device memory atomic operations that make reductions even faster. Reduction kernels that the GPU Coder creates use the shfl_down instruction to reduce across a warp (32 threads) of threads. Then, the first thread of each warp uses the atomic operation instructions to update the reduced value.

For more information on the instructions, refer to the NVIDIA® documentation.

Vector Sum Example

This example shows how to create CUDA reduction type kernels by using GPU Coder. Suppose that you want to create a vector v and compute the sum of its elements. You can implement this example as a MATLAB® function.

function s = VecSum(v)
    s = 0;
    for i = 1:length(v)
       s = s + v(i);

Prepare vecSum for Kernel Creation

GPU Coder requires no special pragma to infer reduction kernels. In this example, use the coder.gpu.kernelfun pragma to generate CUDA reduction kernels. Use the modified VecSum function.

function s = VecSum(v) %#codegen
    s = 0;
    for i = 1:length(v)
       s = s + v(i);

Generated CUDA Code

When you generate CUDA code by using the GPU Coder app or from the command line, GPU Coder creates a single kernel that performs the vector sum calculation. The following is a snippet of vecSum_kernel1.

static __global__ __launch_bounds__(512, 1) void vecSum_kernel1(const real_T *v,
  real_T *s)
  uint32_T threadId;
  uint32_T threadStride;
  uint32_T thdBlkId;
  uint32_T idx;
  real_T tmpRed;
  thdBlkId = (threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x)
    + threadIdx.x;
  threadId = ((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) +
              blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + thdBlkId;
  threadStride = gridDim.x * blockDim.x * (gridDim.y * blockDim.y) * (gridDim.z *
  if (!((int32_T)threadId >= 512)) {
    tmpRed = 0.0;
    for (idx = threadId; threadStride < 0U ? idx >= 511U : idx <= 511U; idx +=
         threadStride) {
      tmpRed += v[idx];

    tmpRed = workGroupReduction1(tmpRed, 0.0);
    if (thdBlkId == 0U) {
      atomicOp1(s, tmpRed);

Before calling VecSum_kernel1, two cudaMemcpy calls transfer the vector v and the scalar s from the host to the device. The kernel has one thread block containing 512 threads per block, consistent with the size of the input vector. A third cudaMemcpy call copies the result of the computation back to the host. The following is a snippet of the main function.

  cudaMemcpy((void *)gpu_v, (void *)v, 4096ULL, cudaMemcpyHostToDevice);
  cudaMemcpy((void *)gpu_s, (void *)&s, 8ULL, cudaMemcpyHostToDevice);
  VecSum_kernel1<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_v, gpu_s);
  cudaMemcpy(&s, gpu_s, 8U, cudaMemcpyDeviceToHost);


For better performance, GPU Coder gives priority to parallel kernels over reductions. If your algorithm contains reduction inside a parallel loop, GPU Coder infers the reduction as a regular loop and generates kernels for it.

1-D Reduction Operations on the GPU

You can use the gpucoder.reduce function to generate CUDA code that performs efficient 1-D reduction operations on the GPU. The generated code uses the CUDA shuffle intrinsics to implement the reduction operation.

For example, to find the sum and max elements of an array A:

function s = myReduce(A)
   s = gpucoder.reduce(A, {@mysum, @mymax}); 

function c = mysum(a, b)
   c = a+b;

function c = mymax(a, b)
   c = max(a,b);
For code generation, the gpucoder.reduce function has these requirements:

  • The input must be of numeric or logical data type.

  • The function passed through the @handle must be a binary function that accepts two inputs and returns one output. The inputs and outputs must be of the same data type.

  • The function must be commutative and associative.


For some inputs that are of the integer data type, the code generated for the gpucoder.reduce function may contain intermediate computations that reach saturation. In such cases, the results from the generated code may not match the simulation results from MATLAB.

See Also

| | | | |

Related Topics