Can I use GPU instead of CPU to run parfor-loop?

Question

Chang seok Ma on 19 May 2022

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/1723280-can-i-use-gpu-instead-of-cpu-to-run-parfor-loop

Commented: theophilus mensah on 10 Apr 2023

Accepted Answer: Raymond Norris

Open in MATLAB Online

Hello,

I am trying to figure out how GPU works when it comes to parallelization.

Below is the part of the code that I am working on.

I usually rely on CPU parallelization. My CPU has 36 cores and as far as I understand, each worker is assigned with each outerloop (in this case 1 to Na is assigned to each worker)

And I was told about this GPU parallelization and Matlab supports GPU tasks. I read some documents but couldn't get how can it work so fast.

If I understand this correctly, a single GPU has more cores than CPU in usual. So I was wondering if GPU parallelization works in the same way by using spmd (each core of GPU is assigned with each loop).

parfor i_a = 1:Na                  
    for i_d = 1:Nd                  
        for i_y = 1:Ny                  
            for i_t = 1:Nt                  
                [adj_sol, adjval] = fminsearchbnd(@(x) ...
                    adjvalue_model1(x,i_a,i_d,i_y,i_t,Utility,A,D,Y,T,R,delta,fixed,Interpol_1,Na,Nd), ...
                    [x0(i_a,i_d,i_y,i_t); x1(i_a,i_d,i_y,i_t)],...
                    [A(1); D(1)],[A(end); D(end)],options);
                     
            end
        end 
    end
end

Any help will be very much appreciated.

Thank you.

4 Comments
Show 2 older commentsHide 2 older comments

Walter Roberson on 10 Apr 2023

@theophilus mensah

No. Parallel Computing Toolbox is designed to evaluate expressions on the GPU. It is not, for example, designed to be able to run for loops. Even indexing of GPU arrays is not as efficient as you might hope.

Individual NVIDIA GPU cores are not able to make program-flow decisions. They are fast mathematical calculators, but they have no idea how to decode instructions and have no ability themselves to handle logic branches. Instead they are grouped together with a controller, 64 or 128 or more do-what-they-are-told cores per controller. The controller does instruction decoding and figures out which individual cores should execute the following instruction or should pause and do nothing. For example if you had an array of 64 elements and one GPU core was responsible for one element of the array, and the code asked to operate on (say) sqrt(A(17)) then the controller would handle that by telling cores 1 to 16 and 18 to 64 to sit out the next instruction, and then would issue a command for all non-supressed cores to execute sqrt() . With core 17 being the only one not surpressed, it would be the only one to execute the sqrt() -- and the others handled by the same controller would sit idle for that step. A loop like for K = 1 : 64; B(K) = sqrt(A(K)); end would involve the controller telling each of the cores in turn to do a sqrt() while the others sat idle.

The only way to run an optimization algorithm on GPU is to use GPU Coder to generate GPU machine code. The code tool contains optimization routines that would try to reduce the number of cores sitting idle.

Now, what you can generally do is write a function that creates GPU arrays, evaluates expressions on the arrays, and eventually asks for the results back, and does whatever is appropriate.

But if you have (for example) 3072 GPU cores, then there is absolutely no way to have 3072 independent calculations going on. You would be lucky, in such a case, to be able to run 24 independent calculations (and that would require a bunch of work.)

theophilus mensah on 10 Apr 2023