Calling competition when calling CUDA kernels using PCT

Question

Geng on 27 Jul 2012

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/44662-calling-competition-when-calling-cuda-kernels-using-pct

I found the CUDA kernel calling function of the PCT very useful. But the calling competition may exist. When I have a .cu file with a kernel "add1", I can firstly get the .ptx file of the kernel, and compile it in matlab, and a matlab function "add1" is built up. And the code "o=feval(add1,arg1,arg2...)" is legal. If I want to update existing variables, "[arg1,arg2]=feval(add1,arg1,arg2...)" is legal. If I have another kernel "add2", then "[arg1,arg2]=feval(add2,arg1,arg2...)" is legal. For a serial code, we may have step1.m doing add1, and step2.m doing add2. And we may have the Forward.m like "for clock =1:tend step1; step2; end" This code may not implement as what we think. Sometimes we get the right answer, but most time, especially when "tend" is larger, the result is wrong. I guess the reason is: when CPU calls step1, GPU runs the kernel, without knowing whether kernel"add1" is finished, the CPU begins to call step2 directly. This would not happen in CUDA C, because in CUDA C, CPU holds when calling a kernel unless GPU returns a signal that no program is in running. So how to solve this problem in matlab？

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Jill Reese on 1 Aug 2012

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/44662-calling-competition-when-calling-cuda-kernels-using-pct#answer_55208

Open in MATLAB Online

In MATLAB we always run CUDA kernels in the order in which they were requested. The add1 kernel should always execute before the add2 kernel, and any arrays computed by the add1 kernel are guaranteed to be ready and available for add2 to use. It may be that there is a latent bug in one of your kernels that is only apparent when run in a tight loop. Some examples include a thread accessing uninitialized data or multiple threads writing to the same location.

Here is one way to truly guarantee that step1 is complete before step2 begins:

g=gpuDevice();
step1;
wait(g) % make sure all GPU execution is complete before executing step2
step2;

2 Comments
Show NoneHide None

Jill Reese on 1 Aug 2012

Another question I have is what operations are happening in step1 and step2? Are you really performing a simple operation like addition or could there be a chance that round-off error accumulated over a very long run is ruining the accuracy of your result. We might be able to help more if example code was provided.

Geng on 2 Aug 2012

Thank you for your answer, maybe this is the solution and I will try. Actually I am implementing LBM, and I don't think there will be any unbearable accumulated round-off error.

Sign in to comment.

Calling competition when calling CUDA kernels using PCT

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Calling competition when calling CUDA kernels using PCT

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None