Create CUDA code for stencil functions



B = gpucoder.stencilKernel(FUN,A,[M N],shape,param1,param2...) applies the function FUN to each [M,N] sliding window of the input A. Function FUN is called for each [M,N] submatrix of A and computes an element of output B. The index of this element corresponds to the center of the [M,N] window.

FUN is the handle to a user-defined function that returns a scalar output of the same type as the input.

C= FUN(X,param1,param2, ...)

X is the [M,N] submatrix of the original input A. X can be zero-padded when necessary, for instance at the boundaries of input A. X and the window can also be 1-D.

C is a scalar valued output of FUN. It is the output computed for the center element of the [M,N] array X and is assigned to the corresponding element of the output array B.

param1,param2 are optional arguments. Pass these arguments if FUN requires any additional parameters in addition to the input window.

The window [M,N] must be less than or equal to the size of A, with the same shape as A.

If A is 1-D row vector, the window must be [1,N].

If A is 1-D column vector, the window must be [N,1].

shape determines the size of the output array B. It can have one of three possible values:

  • 'same' - Returns output B that is the same size as A.

  • 'full' - (default) Returns the full output. Size of B > size of A, that is, if A is of size (x,y). Size of B = [x + floor(M/2), y + floor(N/2)]

  • 'valid' - Returns only those parts of the output that are computed without the zero-padded edges of A. Size of B = [x - floor(M/2), y - floor(N/2)]

The input A must be a vector or matrix with a numeric type supported by FUN. The class of B is the same as the class of A.

Code generation is supported only for fixed size outputs. Shape and window must be compile-time constants because they determine the size of the output.


collapse all

This example shows how to use the gpucoder.stencilKernel and generate CUDA® kernels that perform filtering of an image by using stencil operations.

This example performs mean filtering of a 2-D image. In one file, write the entry-point function test that accepts an image matrix A. Create a subfunction my_mean that computes the mean of the 3x3 submatrix.

function B = meanImgFilt(A)  %#codegen
  B = gpucoder.stencilKernel(@my_mean,A,[3 3],'same');
  function out = my_mean(A)
    out = cast(mean(A(:)), class(A));

Set up the test input image for the meanImgFilt function.

inImage = im2double(imread('cameraman.tif'));

Use the codegen function to generate CUDA MEX function.

codegen -config coder.gpuConfig('mex') -args {inImage} -report meanImgFilt

GPU Coder creates three kernels: meanImgFilt_kernel1 for initializing memory, meanImgFilt_kernel2 for optimizing the input memory structure, and meanImgFilt_kernel3 for mean filtering operation. The following is a snippet of the generated code.

  cudaMalloc(&gpu_B, 524288ULL);
  cudaMalloc(&gpu_A, 524288ULL);
  cudaMalloc(&gpu_expanded, 532512ULL);
  meanImgFilt_kernel1<<<dim3(131U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_expanded);
  cudaMemcpy((void *)gpu_A, (void *)&A[0], 524288ULL, cudaMemcpyHostToDevice);
  meanImgFilt_kernel2<<<dim3(128U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_A,
  meanImgFilt_kernel3<<<dim3(8U, 8U, 1U), dim3(32U, 32U, 1U)>>>(gpu_expanded,
  cudaMemcpy((void *)&B[0], (void *)gpu_B, 524288ULL, cudaMemcpyDeviceToHost);

meanImgFilt_kernel3 uses shared memory (__shared__ qualifier) to improve memory bandwidth and data locality.


  • For very large input sizes, the gpucoder.stencilKernel function may produce CUDA code that does not numerically match the MATLAB® simulation. In such cases, consider reducing the size of the input to produce accurate results..

Introduced in R2017b