Kernel Analysis

For GPU code generation, the primary mechanism for creating CUDA® kernels is by using for-loops. The way you write loops in your MATLAB® code has a significant impact on the number of kernels created as well as the performance of the generated code. When you generate GPU code, check the diagnostic report to see if your loop segment has Loop not parallelized notices. Calls to MATLAB functions in your code may also have for-loops that contain these notices. To get maximum performance, you want to ensure that compute intensive loop segments in your code are mapped to kernels and executed in parallel. The following recommendations help you in achieving this goal and generating efficient CUDA kernels.

Mapping Nested Loops to Kernels

Condition

Consider a function that has nested for-loops.

function y = foo(x)
...
for i1 = 1:N1
for i2 = 1:N2
for i3 = 1:N3
for i4 = 1:N4
...
end
end
end
end

Assume that one of the intermediate loop i3 is not parallelizable. When performs loop analysis to create kernels, GPU Coder™ it considers only the outermost parallel loops i1,i2 and creates a kernel with the outer loop dimensions N1,N2. The loops i3,i4 are within the kernel body and are executed sequentially. However if the innermost i4 is large (iteration), then better performance may be achieved by creating kernels for the innermost loop.

Action

There are three ways in which you can parallelize the innermost loop:

• Rewrite the code so that the innermost code segment is not within a nested loop.

• If the iteration size of the outer loop is small, then attach the loop to a coder.unroll function. This function unrolls the for-loop by making a copy of the loop body for each loop iteration. For more information, see coder.unroll.

function y = foo(x)
...
for i1 = coder.unroll(1:N1)
...
end
• Make the outer loop dimension as dynamic bound. This way parallel loop analysis fails on the outer loop, whereas it succeeds on the inner loops.

function y = foo(x,N1)
...
for i1 = 1:N1
...
end

For-Loops with Break

Condition

Loops with break are not supported.

while (i < N)
...
...
if (cond2)
...
...
break;
end
end

Action

Remove breaks by creating a guard variable and conditional.

cond = true;
while (i< N)
if(cond)
...
...
if(cond2)
cond = false;
end
end
end

Dependence Analysis Parallel Loop Check Fails

Condition

Kernel extraction use parallel loop dependence analysis. There are cases where loop dependence analysis cannot detect a parallel for loop. The coder.gpu.kernel allows GPU Coder to override dependence analysis and force kernel creation. The caveat is for user to be sure that the loop is “for-all” loop with no inter-iteration dependencies.

Action

Use coder.gpu.kernel pragma explicitly on each of your for-loops.

Logical Indexing of Arrays

Condition

GPU Coder may not create kernels when logical indexing is used for accessing array elements.

i = (mag ~= 0);
vx(i) = vx(i)./mag(i);
vy(i) = vy(i)./mag(i);

Action

Rewrite the code by using a loop body and guarding with an appropriate conditional.

for i = 1:numel(mag)
if (mag(i) ~= 0)
vx(i) = vx(i)./mag(i);
vy(i) = vy(i)./mag(i);
end
end

Unsupported Functions

Condition

Use of unsupported functions, coder pragmas, toolbox functions etc. inside a loop prevents them from becoming a kernel.

Action

Try rewriting unsupported functions using pure MATLAB.

Loop Interchange

Condition

If smaller loops in a loop nest are the outer most loops, then a kernel could be created with just a subset of the loops in the nesting. If algorithm allows it, always put the largest loops in the outermost nesting.

Action

Rewrite loop nesting with larger loops as outer loops.