Kernels from Element-Wise Loops

The simplest case of CUDA® kernel creation is from MATLAB® functions that contain scalarized, element-wise math operations. When element-wise operations are enclosed within a for-loop body, concurrent CUDA threads can be invoked to compute each loop iteration in parallel. Because CUDA threads execute in no particular order, and are independent of each other, it is essential that no iteration in your for-loop depends on the results of other iterations.

Element-Wise Math Example

This example shows how to create CUDA kernels from functions that contain element-wise math operations. Suppose that you want to square each element of a matrix x and scale by a factor of 1/(i+j), where i,j are the row and column indexes. You can implement this example as a MATLAB function.

function [y] = myFun(x)

y = zeros(size(x));
for i = 1:size(x,1)
    for j = 1:size(x,2)

Preparing myFun for Code Generation

The first statement zeros(size(A)) in the myFun function is to initialize result vector y to zeros. For CUDA code generation, pre-allocate memory for y without incurring the overhead of initializing the memory to zeros. Replace this line with coder.nullcopy(zeros(size(y))).

To create CUDA kernels from loops, GPU Coder™ provides another pragma coder.gpu.kernel. Specifying this kernel pragma overrides all parallel-loop analysis. If you do not specify any parameters, GPU Coder determines the kernel bounds based on the loop bounds and input size. It provides a way for you to specify kernel launch parameters such as thread and block sizes. However, use it only when you know that the loop is safe to parallelize. Because the myFun example is simple and does not require specification of the kernel launch parameters, you can utilize the coder.gpu.kernelfun pragma to generate CUDA kernels.

With these modifications, the original myFun function is suitable for code generation.

function [y] = myFun(x) %#codegen

y = coder.nullcopy(zeros(size(x)));
for i = 1:size(x,1)
    for j = 1:size(x,2)

Generated CUDA Code

When you generate CUDA code by using the GPU Coder app or from the command line, GPU Coder creates a single kernel that performs squaring and scaling operation. The following is a snippet of the myFun_kernel1 kernel code.

static __global__ __launch_bounds__(512, 1) void myFun_kernel1(const real_T *x,
  real_T *y)
threadId = ((((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) +
                blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) +
               threadIdx.z * blockDim.x * blockDim.y) + threadIdx.y * blockDim.x)
    + threadIdx.x;
  i = (int32_T)(threadId / 512U);
  j = (int32_T)(threadId - (uint32_T)i * 512U);
  if ((!(j >= 512)) && (!(i >= 512))) {
    y[i + (j << 9)] = x[i + (j << 9)] * x[i + (j << 9)] / ((real_T)(i + j) + 2.0);

The following is a snippet of the main myFun function. Before calling myFun_kernel1, there is a single cudaMemcpy call that transfers the matrix x from the host (x) to the device (gpu_x). The kernel has 512 blocks containing 512 threads per block, consistent with the size of the input vector. A second cudaMemcpy call copies the result of the computation back to the host.

cudaMemcpy((void *)gpu_x, (void *)x, 2097152ULL, cudaMemcpyHostToDevice);
myFun_kernel1<<<dim3(512U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_x, gpu_y);
cudaMemcpy((void *)y, (void *)gpu_y, 2097152ULL, cudaMemcpyDeviceToHost);


  • If the loop bounds are of the unsigned data type, the code generator may add conditional checks to determine if the loop bounds are valid. These conditional checks may limit optimizations that are performed by the software and introduce reduction kernels that can affect performance.

See Also

| | | |

Related Topics