Multiple GPU usage in Parallel

Question

0 votes

There just isn't a ton of information out there about using multiple gpus.

I apologize in advance for not posting exact code but only pseudocode. I've also kind of felt my way through the available matlab parallel structures to get the form that I want.

What I don't get is the performance I want. Here is what I'm doing.

Number_of_Things = 4; 
parpool =  4
spmd(4) % Parallel region 1 allocate gpus. Yes, my box has 4 gpus.
      gd = gpuDevice;
end
single processor stuff gets done relating to data initialization.
spmd(4) % Parallel region 2 Push the local data to the different gpus.
      gpu_data = gpuArray(localdata(labindex))
end
spmd(4) % parallel region 3 do the work
      process(gpu_data)
end
spmd(4) % parallel region 4 gather the data.
      output(labindex) = gather(results)
end

Now please recognize the code that I've psuedocoded does what I want it to do.

I've put things in this form for timing purposes.

I've verified that I'm using 4 different gpus.

As I vary the Number_of_Things that the timing for regions 1,2 & 4 show an increase as number of things increases. I expect that for 1 and 4 and I accept that for #2 as there is a good bit of data being transferred.

What I don't understand is a linear increase in the time of region 3 as the number of things increases. If I pull out the references to gpus and just use standard processors my time goes large,but flat with respect to the number of things. I don't understand why my timing is not flat in the processing region and would appreciate thoughts. My only explanation is that transferring the commands in region 4 to the different gpus is causing interference and slowing thing down in a linear way.

A single thing takes 40 seconds to process. Each multiple thing adds 10 seconds.

8 Comments
Show 6 older comments Hide 6 older comments

David Short on 21 Apr 2017

Edited: Walter Roberson on 27 Apr 2017

Open in MATLAB Online

Joss

"I don't understand why you would think that your processing time wouldn't go up as you increase the number of things." I guess because I'm spoiled and used to seeing parfor loops where you get a pretty flat response as you add more workers.

I suspect I've done a poor job of explaining things. To use your example I want to solve 4 independent 5000X5000 linear systems using 4 gpus and I was hoping that would take about as much time as solving a single 5000X5000 linear system on a single gpu. Does that help?

hopefully the example below helps. It's a reasonable approximation of what I'm doing (in this case on a system with 8 gpu's)

Here is the output

allocate: 4.1681 send: 10.435 compute: 2.7707 gather: 1.0183
allocate: 4.941 send: 11.4104 compute: 3.1639 gather: 1.2482
allocate: 6.0978 send: 12.3541 compute: 3.7005 gather: 1.5098
allocate: 7.0844 send: 13.6551 compute: 3.8716 gather: 1.736
allocate: 8.3564 send: 14.331 compute: 4.3651 gather: 2.0024
allocate: 9.4381 send: 15.3916 compute: 4.7052 gather: 2.2718
allocate: 11.038 send: 16.8184 compute: 5.0789 gather: 2.4995
allocate: 11.8256 send: 18.4739 compute: 5.2926 gather: 2.891

Notice that as we add more independent gpus the time of each segment increases. In my actual case, the send and gather times are trivial, but the compute time is much larger and the time expansion as I add more gpus is even more dramatic. In my case each single system will take 40 seconds and adding another system will add about 10 seconds to the execution time.

function test_two
matrix_size=5000;
pep = randi(255,matrix_size,matrix_size,12);
pep = reshape(pep,[matrix_size matrix_size 12]);        
for j = 1:8
      num_chan =j;
      poolobj = gcp('nocreate');
      delete(poolobj);
      parpool(num_chan);
      tic
      spmd
          gd = gpuDevice;
      end
      a = toc;

spmd

      spmd (j);
          pep_gpu = gpuArray(pep(:,:,j));
      end
      b = toc;
      spmd (j);
          R_gpu = work_for_test_two(pep_gpu);
      end
      c = toc;   
      spmd (j);
          R = gather(R_gpu);
      end
      d = toc;   
      clear R R_gpu pep_gpu;
      disp (['  ' num2str(j) ' allocate: ' num2str(a) ' send: ' num2str(b-a) ' compute: ' num2str(c-b) ' gather: ' num2str(d-c)]);
  end
function [R] = work_for_test_two(I);
f = fftshift(fft2(I));
thresh = 0.8*max(f(:));
mask = f >thresh;
proc = f;
proc(mask) = 0;
R = ifft2(ifftshift(proc));

If I remove all the housekeeping associated with gpus, I end up with output that looks like....

1 compute: 56.7

2 compute: 57.7

3 compute: 59:3

4 compute: 61.4

5 compute: 63.2

...

Does that help?

Joss Knight on 25 Apr 2017

Open in MATLAB Online

Hi, I apologize if my answer is curt - for a complete response to your specific code example you are best off contacting MathWorks support.

If you are using a parallel pool and you have multiple GPUs then you can indeed run them all in parallel. Let's look at your code:

Stop opening and closing spmd blocks with every command. You can do more than one thing inside spmd, but every time you close and then reopen the block the workers are forced to synchronize, which is costly.
There's no need to pass an argument to spmd if you are using all the workers in the pool.
There is a cost to using a parallel pool, and particularly to using spmd which is intended for communicating work. Synchronization between workers takes time, and more time for more workers.

Since your workers don't need to communicate with each other, really you shouldn't be using SPMD at all, you should be using parfor or parfeval. I even suggest you disable SPMD in your pool because that stops it unnecessarily creating an MPI communicator, which you don't need. Everything should proceed much more easily then.

matrix_size=5000;
pep = randi(255,matrix_size,matrix_size,12);
pep = reshape(pep,[matrix_size matrix_size 12]);
for j = 1:8
      num_chan =j;
      poolobj = gcp('nocreate');
      delete(poolobj);
      parpool(num_chan, 'SpmdEnabled', false);
      tic
      parfor i = 1:j
          gd = gpuDevice;
          pep_gpu = gpuArray(pep(:,:,j));
          R_gpu = work_for_test_two(pep_gpu);
          R = gather(R_gpu);
      end
      tic
      clear R R_gpu pep_gpu;
end

In many ways parfeval is more appropriate even than parfor here, in fact parfevalOnAll is what you want. parfor schedules work in an opaque way and has to do some analysis to decide what data to send to each worker. However, parfeval is a bit more complicated to use.

David Short on 27 Apr 2017

Edited: Walter Roberson on 27 Apr 2017

Open in MATLAB Online

In case anyone is interested, recasting the code with a parfeval and fetchNext structure does produce marginal improvements in the test case, but still illustrates my primary concern. Adding independent workers using another gpu produces a consistent increase in processing time.

In the table below the first column is the number of workers and gpus. The second column is the time spent allocating the gpus. The third column is the time spent doing the work using a parfor structure. The fourth is the time spent doing the work using a parfeval structure. Note that the timing IS better using the parfeval structure, but in both the parfor and parfeval case the time consistently increases as we add workers doing independent work. It's not just data communication time.

2 allocate: 5.3017 run parfor: 4.689 run feval: 2.9313
3 allocate: 6.1161 run parfor: 5.7515 run feval: 4.4275
4 allocate: 7.1411 run parfor: 6.2636 run feval: 4.9099
5 allocate: 8.1599 run parfor: 6.6393 run feval: 5.7411
6 allocate: 9.0779 run parfor: 8.5535 run feval: 7.9715
function test_parfeval
matrix_size=5000;
pep = randi(255,matrix_size,matrix_size,12);
pep = reshape(pep,[matrix_size matrix_size 12]);        
for j = 2:6
      num_chan =j;
      poolobj = gcp('nocreate');
      delete(poolobj);
      p = parpool(num_chan, 'SpmdEnabled', false);
      tic
      parfor i = 1:j
          gd = gpuDevice;
      end
      a = toc;   
      parfor i = 1:j
          pep_gpu = gpuArray(pep(:,:,i));
          R_gpu = work_for_test_two(pep_gpu);
          R(:,:,i) = gather(R_gpu);
      end
      b = toc;   
      for i = 1:j
          F(i) = parfeval( @stuff, 1, pep(:,:,i));
      end
      c = toc;   
      for i = 1:j
          [idx RR] = fetchNext(F);
          dfs(:,:,idx) = RR;
      end
      d = toc;
      clear R R_gpu pep_gpu dfs RR F;
      disp (['  ' num2str(j) ' allocate: ' num2str(a) ' run parfor: ' num2str(b-a) ' run feval: ' num2str(d-b)]);
  end
function [R] = stuff (pep)
pep_gpu = gpuArray(pep);
R_gpu = work_for_test_two(pep_gpu);
R = gather(R_gpu);
function [R] = work_for_test_two(I)
f = fftshift(fft2(I));
thresh = 0.8*max(f(:));
mask = f >thresh;
proc = f;
proc(mask) = 0;
R = ifft2(ifftshift(proc));

Walter Roberson on 27 Apr 2017

David Short:

When you are posting code, please use your cursor to select it, and then click on the "{} Code" button. That would format the code so that the Answers system knows it is code for presentation purposes.

Multiple GPU usage in Parallel

8 Comments
Show 6 older comments Hide 6 older comments

Answers (0)

Categories

Tags

Community Treasure Hunt

Multiple GPU usage in Parallel

8 Comments Show 6 older comments Hide 6 older comments

Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

8 Comments
Show 6 older comments Hide 6 older comments