GPU Coder™ also supports the concept of reductions - an important exception to the rule
that loop iterations must be independent. A reduction variable accumulates a value that
depends on all the iterations together, but is independent of the iteration order. Reduction
variables appear on both side of an assignment statement, such as in summation, dot product,
and sort. The following example shows the typical usage of a reduction variable
`x`

:

x = ...; % Some initialization of x for i = 1:n x = x + d(i); end

The variable `x`

in each iteration gets its value either before entering
the loop or from the previous iteration of the loop. This serial order type implementation is
not suitable for parallel execution due to the chain of dependencies in the sequential
execution. An alternative approach is to employ a binary tree-based approach.

In the tree-based approach, you can execute every horizontal level of the tree in parallel
over a certain number of passes. When compared to sequential execution, the binary tree does
require more memory because each pass requires an array of temporary values as output. The
performance benefit that you receive far outweighs the cost of increased memory usage.
GPU Coder creates reduction kernels by using this tree-based approach wherein each thread
block reduces a portion of the array. Parallel reduction requires partial result data
exchanges between thread blocks. In older CUDA^{®} devices, this data exchange was achieved by using shared memory and thread
synchronization. Starting with the Kepler GPU architecture, CUDA provides shuffle (`shfl`

) instruction and fast device memory
atomic operations that make reductions even faster. Reduction kernels that the GPU Coder creates use the `shfl_down`

instruction to reduce across a warp
(32 threads) of threads. Then, the first thread of each warp uses the atomic operation
instructions to update the reduced value.

For more information on the instructions, refer to the NVIDIA^{®} documentation.

This example shows how to create CUDA reduction type kernels by using GPU Coder. Suppose that you want to create a vector `v`

and compute the
sum of its elements. You can implement this example as a MATLAB^{®} function.

function s = VecSum(v) s = 0; for i = 1:length(v) s = s + v(i); end end

GPU Coder requires no special pragma to infer reduction kernels. In this example, use
the `coder.gpu.kernelfun`

pragma to generate CUDA reduction kernels. Use the modified `VecSum`

function.

function s = VecSum(v) %#codegen s = 0; coder.gpu.kernelfun(); for i = 1:length(v) s = s + v(i); end end

When you generate CUDA code by using the GPU Coder app or from the command line, GPU Coder creates a single kernel that performs the vector sum calculation. The
following is a snippet of `vecSum_kernel1`

.

static __global__ __launch_bounds__(512, 1) void vecSum_kernel1(const real_T *v, real_T *s) { uint32_T threadId; uint32_T threadStride; uint32_T thdBlkId; uint32_T idx; real_T tmpRed; ; ; thdBlkId = (threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x) + threadIdx.x; threadId = ((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + thdBlkId; threadStride = gridDim.x * blockDim.x * (gridDim.y * blockDim.y) * (gridDim.z * blockDim.z); if (!((int32_T)threadId >= 512)) { tmpRed = 0.0; for (idx = threadId; threadStride < 0U ? idx >= 511U : idx <= 511U; idx += threadStride) { tmpRed += v[idx]; } tmpRed = workGroupReduction1(tmpRed, 0.0); if (thdBlkId == 0U) { atomicOp1(s, tmpRed); } } }

Before calling `VecSum_kernel1`

, two `cudaMemcpy`

calls transfer the vector `v`

and the scalar `s`

from the
host to the device. The kernel has one thread block containing 512 threads per block,
consistent with the size of the input vector. A third `cudaMemcpy`

call
copies the result of the computation back to the host. The following is a snippet of the
main function.

cudaMemcpy((void *)gpu_v, (void *)v, 4096ULL, cudaMemcpyHostToDevice); cudaMemcpy((void *)gpu_s, (void *)&s, 8ULL, cudaMemcpyHostToDevice); VecSum_kernel1<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_v, gpu_s); cudaMemcpy(&s, gpu_s, 8U, cudaMemcpyDeviceToHost);

For better performance, GPU Coder gives priority to parallel kernels over reductions. If your algorithm contains reduction inside a parallel loop, GPU Coder infers the reduction as a regular loop and generates kernels for it.

You can use the `gpucoder.reduce`

function to generate CUDA code that performs efficient 1-D reduction operations on the GPU. The
generated code uses the CUDA shuffle intrinsics to implement the reduction operation.

For example, to find the `sum`

and `max`

elements of
an array
`A`

:

function s = myReduce(A) s = gpucoder.reduce(A, {@mysum, @mymax}); end function c = mysum(a, b) c = a+b; end function c = mymax(a, b) c = max(a,b); end

`gpucoder.reduce`

function has these requirements:
The input must be of numeric or logical data type.

The function passed through the @handle must be a binary function that accepts two inputs and returns one output. The inputs and outputs must be of the same data type.

The function must be commutative and associative.

For some inputs that are of the integer data type, the code generated for the
`gpucoder.reduce`

function may contain intermediate computations that
reach saturation. In such cases, the results from the generated code may not match the
simulation results from MATLAB.

`coder.gpu.constantMemory`

| `coder.gpu.kernel`

| `coder.gpu.kernelfun`

| `gpucoder.matrixMatrixKernel`

| `gpucoder.reduce`

| `gpucoder.stencilKernel`