Preallocation of composites using smpd
4 views (last 30 days)
Show older comments
I'm seeing dramatically non-linear execution times with the test code below where I am allocating a different gpu for up to 4 spmd worker. (Yes, I do have the hardware) I'll then make some work on each worker and time it for 10 trials.
Note the clear line w/in the trial loop but outside the smpd loop.
If that clear is included the trial_times make sense. If that clear is not included the trial_times do not make sense.
As an example when n_gpu's = 2 with the clear produces values in trial_time with a narrow range of 0.0938 to 0.1111, but w/out the clear I get 0.3884 0.0915 6.4601 15.2599 15.2746 15.2792 15.2892 15.2900
....I'm left pondering that if this were not smpd code I would find a way to preallocate the data, but I'm not sure how to do that with composites in this case.
Ideas and explanations are welcome.
for N_gpus=1:4
poolobj = gcp('nocreate'); % If no pool, do not create new one.
if isempty(poolobj)
poolobj = parpool( N_gpus );
poolsize = poolobj.NumWorkers;
else
poolsize = poolobj.NumWorkers;
end
for trial=1:10
spmd( N_gpus )
g = gpuDevice();
end
tic
spmd( N_gpus )
for m=1:50
A = rand(5000,5000,'gpuArray');
B = rand(5000,5000,'gpuArray');
C = A * B;
max_C = max(C);
end
end
clear A B C; %%THIS IS THE INTERESTING LINE
trial_time(trial)=toc;
end
trial_time;
tt = mean(trial_time(1:10));
fprintf( 'N=%d time=%6.3f \n', N_gpus, tt );
poolobj = gcp( 'nocreate' );
delete( poolobj );
end
0 Comments
Accepted Answer
Joss Knight
on 19 May 2017
It seems like this is just an issue of timing and synchronisation. You can see this by adding a call to wait(g) at the end of your spmd block, which will eliminate the dependency on use of clear.
Basically, if you don't call clear then the first call to rand for each trial doesn't have enough pooled memory, so it has to do a raw allocation. Of course, as it turns out it could have freed up the memory currently being used by A, but it doesn't know that it isn't going to error, so it has to create a copy first, in case A needs to be left unchanged (this wouldn't be true if your entire script was inside a function, since A doesn't have to be preserved if there's an error).
When you do a raw allocation, the device has to be synchronised. But when you do call clear the memory for A, B and C is returned to the pool and so the next time no raw allocation is needed. So no synchronisation happens. So the loop happily continues, queuing up 300 or so kernels and then exiting the spmd block and recording the time on the client, long before any of those kernels have actually finished.
So when you don't call clear you're usually getting the actual time of the previous trial, and when you do you're recording completely the wrong time, since the computations haven't finished yet.
Depending on the GPU memory, how much is needed, how much is available when the code is called, how much is already pooled due to earlier operations (MATLAB by default pools memory up to a quarter of device memory), and whether or not you're inside a function, your timing will give different results. Your best bet for getting realistic timings is to use gputimeit, or if you must, use tic and toc in conjunction with wait. However, the pool will always create confusion here because you don't necessarily know when raw allocations (which are costly even ignoring synchronisation) are going to happen.
4 Comments
Joss Knight
on 24 May 2017
Edited: Joss Knight
on 24 May 2017
By the way, the values in max_C are fine. Asynchronous execution NEVER means you get wrong answers. If you ever ask to see, copy, or operate on the results of an operation, it will ensure that operation is finished before doing that (e.g. it won't display max_C without finishing computing max_C).
More Answers (0)
See Also
Categories
Find more on Parallel Computing Fundamentals in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!