Failed to generate large CUDA kernel in GPU coder with FFT function inside

9 views (last 30 days)
I am trying to get my code paralle in GPU.
I have converted the code with the "main.m" script as attached. But the mex code on GPU is much slower than the m code on CPU. I understand that the GPU is not suitable for such small data size. But it takes much much longer time on the GPU if bigger data size is used.
Then I check the profilling timeline. I find that many cuda kernel is created and the overall GPU utilization is low. After some debugging, I find that if the fft command is used, the GPU coder failed to generate large CUDA kernel.
I think that the perfermance can be improved significantly if the fft can be incoporate inside one CUDA kernel like the situation without fft. FFT is needed. I have try to search on Google, but nothing relative can be found. Can you provide any information about this or any solution? The output of gpuDevice is also provided in the attachment.
Here is the profilling timeline without fft.
Here is the profilling timeline with fft.

Answers (1)

Justin Hontz
Justin Hontz on 18 Sep 2024
Hi He,
In your M-code for RandCopy, the for loop cannot be executed as a GPU kernel (even with the coder.gpu.kernel pragma) because of the fft / ifft calls inside of the loop. This is because fft is implemented using its own specialized GPU kernel, and GPU Coder does not supported nested kernels execution. Consequently, the for loop runs sequentially, which explains why you see thousands of small kernel instaces within the performance analyzer timeline graph.
To improve the performance of your code, you will want to perform your computation using only a single fft / ifft call that operates on the entire input array instead of individual slices. Something like this should work:
Tmp = fft(Data,[],2);
Tmp = Tmp + (1 + 1i);
Tmp = Tmp * (1564 + 798i);
Data = ifft(Tmp,[],2);
After making the change on my end, the performance analyzer report shows a significant performance improvement, with the timeline graph looking similar to the original one without fft.
  4 Comments
He Da
He Da on 19 Sep 2024
I fully understand the benefit of calculation on the entire array, which is the way I am working for years. However, it is not suitable inherently. I haved tried to disable cuFFT in the coder config, which results thousands of memory copy between the host and device. Maybe it requires other optimzation.
It said:NVIDIA cuFFT introduces cuFFTDx APIs, device side API extensions for performing FFT calculations inside your CUDA kernel. Fusing numerical operations can decrease the latency and improve the performance of your application.
It seems like that the cuFFT can be called from device code. Hopefully you can show me how to use cuFFTDx in RandCopy.m. Perhaps that may be overly demanding.
Justin Hontz
Justin Hontz on 19 Sep 2024
GPU Coder currently does not support generating direct calls to the cuFFTDx API. With that said, however, you may still be able to indirectly call into the API in the generated code if you are willing to write your own CUDA wrapper function that directly uses the API. This can possibly be achieved by invoking the wrapper function inside the for loop of your M-code via coder.ceval. The call would look something like this:
coder.ceval('-gpudevicefcn', 'myFFTWrapper', coder.ref(data), ...);
The -gpudevicefcn flag indicates that the wrapper function is meant to be executed by a GPU thread rather than by the CPU.
Note that I have not tried using this approach on my end, so I cannot guarantee that such an approach would work correctly without issue.

Sign in to comment.

Categories

Find more on Kernel Creation from MATLAB Code in Help Center and File Exchange

Products


Release

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!