Main Content


Pragma that maps function to GPU kernels



coder.gpu.kernelfun() is a global-level pragma that attempts to map all the computation within the function it resides in on to the GPU. Loops within this function are parallelized into GPU kernels only if they pass the parallel-loop analysis check. This analysis tries to prove that every loop iteration is independent of each other.

This pragma does not require any input parameters. It generates kernels whose dimensions are computed automatically based on loop parameters.

This function is a code generation function. It has no effect in MATLAB®.


collapse all

This example shows how to use the kernelfun pragma in a function and generate CUDA® code.

In one file, write the entry-point function scalars that accepts two vector inputs x,y of size 1x4096 and one scalar input scale. The function has two for-loops of different iteration lengths, one for vector addition and one for finding the cumulative sum. Place the coder.gpu.kernelfun() pragma within the scalars function.

function [vout, sout1] = scalars(x,y,scale)
    sout1 = 0;
    vout = coder.nullcopy(zeros(1,1024));
    for i=1:1024
        vout(i) = x(i) + y(i);

    for i=1:4096
        sout1 = (x(i)*scale) + sout1;    

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex')...
 -args {ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}...
 -report scalars

GPU Coder creates three kernels: scalars_kernel1 for initializing sout1=0, scalars_kernel2 for vector addition, and scalars_kernel3 is the reduction kernel for the cumulative sum.

  scalars_kernel1<<<dim3(1U, 1U, 1U), dim3(32U, 1U, 1U)>>>(gpu_sout1);
  cudaMemcpy(gpu_y, y, 32768U, cudaMemcpyHostToDevice);
  cudaMemcpy(gpu_x, x, 32768U, cudaMemcpyHostToDevice);
  scalars_kernel2<<<dim3(2U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_y, gpu_x, gpu_vout);
  scalars_kernel3<<<dim3(8U, 1U, 1U), dim3(512U, 1U, 1U)>>>(scale, gpu_x, gpu_sout1);
  cudaMemcpy(vout, gpu_vout, 32768U, cudaMemcpyDeviceToHost);
  cudaMemcpy(sout1, gpu_sout1, 8U, cudaMemcpyDeviceToHost);

scalars_kernel2 has two blocks with 512 threads per block for a total of 1024 threads, one for adding each element. Similarly, scalars_kernel3 has eight blocks with 512 threads per block resulting in a total of 4096 threads. GPU Coder also performs an optimization that minimizes the number of cudamMemcpy function calls. In this example, a copy of the input x is in the GPU, no extra cudamMemcpy is required between scalars_kernel2 and scalars_kernel3. In addition to memory optimization, any sequential code between kernels is mapped to the CUDA threads to keep data on the GPU.

Version History

Introduced in R2017b