Multiple GPU usage in Parallel
Show older comments
There just isn't a ton of information out there about using multiple gpus.
I apologize in advance for not posting exact code but only pseudocode. I've also kind of felt my way through the available matlab parallel structures to get the form that I want.
What I don't get is the performance I want. Here is what I'm doing.
Number_of_Things = 4;
parpool = 4
spmd(4) % Parallel region 1 allocate gpus. Yes, my box has 4 gpus.
gd = gpuDevice;
end
single processor stuff gets done relating to data initialization.
spmd(4) % Parallel region 2 Push the local data to the different gpus.
gpu_data = gpuArray(localdata(labindex))
end
spmd(4) % parallel region 3 do the work
process(gpu_data)
end
spmd(4) % parallel region 4 gather the data.
output(labindex) = gather(results)
end
Now please recognize the code that I've psuedocoded does what I want it to do.
I've put things in this form for timing purposes.
I've verified that I'm using 4 different gpus.
As I vary the Number_of_Things that the timing for regions 1,2 & 4 show an increase as number of things increases. I expect that for 1 and 4 and I accept that for #2 as there is a good bit of data being transferred.
What I don't understand is a linear increase in the time of region 3 as the number of things increases. If I pull out the references to gpus and just use standard processors my time goes large,but flat with respect to the number of things. I don't understand why my timing is not flat in the processing region and would appreciate thoughts. My only explanation is that transferring the commands in region 4 to the different gpus is causing interference and slowing thing down in a linear way.
A single thing takes 40 seconds to process. Each multiple thing adds 10 seconds.
8 Comments
Joss Knight
on 18 Apr 2017
I don't understand why you would think that your processing time wouldn't go up as you increase the number of things? This only works with a GPU if it isn't fully utilized. If you do a small matrix multiplication on your GPU, small enough that not all the cores are busy, then maybe you can hope that a bigger matrix multiply wouldn't take any longer. But to go from, for instance, solving a 500x500 linear system to solving a 1000x1000 linear system is definitely going to take longer. There's more data to move around, and you'll need more iterations to complete the solution.
So really it depends entirely what you're doing inside region 3.
I also can't explain why the GPU performance is bound to the number of things but the CPU isn't without knowing more. Whatever it is you're doing is apparently memory bound, which means it's affected by the number of things. Whereas on the CPU it's probably compute bound.
Do you actually divide your computation up into 4 spmd blocks? Any particular reason?
David Short
on 21 Apr 2017
Edited: Walter Roberson
on 27 Apr 2017
Joss Knight
on 25 Apr 2017
Hi, I apologize if my answer is curt - for a complete response to your specific code example you are best off contacting MathWorks support.
If you are using a parallel pool and you have multiple GPUs then you can indeed run them all in parallel. Let's look at your code:
- Stop opening and closing spmd blocks with every command. You can do more than one thing inside spmd, but every time you close and then reopen the block the workers are forced to synchronize, which is costly.
- There's no need to pass an argument to spmd if you are using all the workers in the pool.
- There is a cost to using a parallel pool, and particularly to using spmd which is intended for communicating work. Synchronization between workers takes time, and more time for more workers.
Since your workers don't need to communicate with each other, really you shouldn't be using SPMD at all, you should be using parfor or parfeval. I even suggest you disable SPMD in your pool because that stops it unnecessarily creating an MPI communicator, which you don't need. Everything should proceed much more easily then.
matrix_size=5000;
pep = randi(255,matrix_size,matrix_size,12);
pep = reshape(pep,[matrix_size matrix_size 12]);
for j = 1:8
num_chan =j;
poolobj = gcp('nocreate');
delete(poolobj);
parpool(num_chan, 'SpmdEnabled', false);
tic
parfor i = 1:j
gd = gpuDevice;
pep_gpu = gpuArray(pep(:,:,j));
R_gpu = work_for_test_two(pep_gpu);
R = gather(R_gpu);
end
tic
clear R R_gpu pep_gpu;
end
In many ways parfeval is more appropriate even than parfor here, in fact parfevalOnAll is what you want. parfor schedules work in an opaque way and has to do some analysis to decide what data to send to each worker. However, parfeval is a bit more complicated to use.
David Short
on 26 Apr 2017
Edited: David Short
on 26 Apr 2017
David Short
on 27 Apr 2017
Edited: Walter Roberson
on 27 Apr 2017
David Short
on 27 Apr 2017
Edited: David Short
on 27 Apr 2017
Walter Roberson
on 27 Apr 2017
David Short:
When you are posting code, please use your cursor to select it, and then click on the "{} Code" button. That would format the code so that the Answers system knows it is code for presentation purposes.
David Short
on 27 Apr 2017
Answers (0)
Categories
Find more on Parallel for-Loops (parfor) in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!