parfeval inside a class method does not update a class property

I have a class with a very time consuming method heavyTask(). This method operates on each element of an array stored as a property. Since the execution on one element does not depend on others, I want to speed up the execution by using both local CPU and GPU in parallel. An abstraction of this class would be:
classdef myClass < matlab.mixin.Copyable
properties
items
result
end
methods
function self = myClass()
end
function execute(self, N)
self.items = 1:N;
f(1) = parfeval(@heavyTask, 0, self, false);
f(2) = parfeval(@heavyTask, 0, self, true);
fetchOutputs(f);
end
function heavyTask(self, gpu)
while not(isempty(self.items))
n = self.items(1);
self.items(1) = [];
if gpu
self.result(n) = gather(mean(real(eig(rand(1000, 'double', 'gpuArray')))));
else
self.result(n) = mean(real(eig(rand(1000))));
end
end
end
end
end
I use parfeval() to run two instances of heavyTask() in parallel. One uses gpuArray and the other does not. The workload is split with the self.items list of array items not processed yet. heavyTask() will check this list, pick one element, and remove it from the list. I cannot predict the number of array elements that each worker can process, so this first come first served idea is the only approach I came up with.
This is how I create the class and execute the method:
a = myClass;
a.execute(4);
a.result
Unfortunately, this is what I get:
a.result
ans =
[]
However, if I replace parfeval() and fetchOutputs() with heavyTask(self, true) I get the desired behaviour:
a.result
ans =
0.4959 0.4969 0.4891 0.4778
I have not found any question that has answered my issue. The closest match I have got is this, but it does not seem to address my problem.
Is this the expected behaviour? Is there any workaround I can implement in my class?
Many thanks in advance for your help!

 Accepted Answer

Couldn't you do something like this?
function execute(self, N)
gd=gpuDevice;
%Test execution times
tic;
result(1)=mean(real(eig(rand(1000, 'double', 'gpuArray'))));
wait(gd);
tgpu=toc;
tic;
result(2)=mean(real(eig(rand(1000))));
tcpu=toc;
T=floor(tcpu*(N-2)/(tcpu+tgpu));
%Divided
parfor n=3:N
if n<=T %GPU
result(n) = gather(mean(real(eig(rand(1000, 'double', 'gpuArray')))));
else %CPU
result(n) = mean(real(eig(rand(1000))));
end
end
self.result=result;
end

10 Comments

Yes and no. heavyTask() execution time could be very different between CPU and GPU, and to calculate T I need to wait for both to finish. This class is executed on a bigger code that calls it thousands of times with different data, and the waits to calculate T every iteration would slow it down. A workaround could be to calculate T only once, but the entire execution takes hours to days and I guess the CPU load may change over the time, making T inefficient.
to calculate T I need to wait for both to finish.
For that to make a significant difference, N and the ratio tgpu/tcpu would have to be pretty small. At such an extreme, it would only make sense to run all of the iterations on the GPU.
Not really. tgpu/tcpu is around 0.1 in my actual code. But it really does sense to split the workload between CPU and GPU because, despite tcpu being much larger than tgpu, heavyTask() can be executed in parallel in different CPU cores. That makes the "effective" ratio between GPU and CPU to be close to 1.
heavyTask() can be executed in parallel in different CPU cores.That makes the "effective" ratio between GPU and CPU to be close to 1.
If so, then you already know in advance that you should assign approximately half the parfor loop iterations to the CPU and half to the GPU (T=1/2).
If you still think you need a more optimal calculation of T, then you can revise the calculation of tcpu to use multiple workers,
tic;
parfor i=2:numWorkers+1
result(i)=mean(real(eig(rand(1000))));
end
tcpu=toc;
You can also use parFeval to have tcpu and tgpu could be caclulated in parallel. If you are right that the effective" ratio between GPU and CPU to be close to 1, then these pre-compuations should take take about the time of a single execution of heavyTask on the GPU.
I think you are right and this should be the best approach. I realised that splitting the workload at run time makes the best performing worker (usually the GPU) to finish all the available elements to process and then wait for the other workers to finish. If the GPU to CPU ratio is very small, the GPU idles many times, making the parallelisation slower than GPU only.
With T I can calculate the number of elements to be sent to each worker such that the GPU is always slightly more loaded than other workers so it never idles. I will try it.
However I have an issue calculating T. When I first run heavyTask() on the GPU to estimate T, it is much slower than further executions, making a wrong estimation of tgpu and T. Is there any reason for this? I do a reset(gpuDevice(1)) at the very beginning of execute() to free up the GPU memory, if that makes any difference.
And you've included wait(gd)?
No, should I? I have a gather() right before the toc, I thought I do not need any wait().
No, should I?
Since it was in the code I suggested to you, naturally I'm going to assume that it is in your code as well... In any case I don't know if gather() has the same synchronizing effect as wait(), so you should probably try it. You might also try doing a trivial initial operation on the GPU before the tic...toc in case it needs to be warmed-up or something, after your reset.
One other thing that occurs to me is that you should probably not be gathering() the result of your GPU computations after every iteration, since this might have undesirable overhead. You should probably leave everything on the GPU until all computations are done and then transfer it at the very end. This can be done by modifying as below,
parfor n=3:N
if n<=T %GPU
result{n} = mean(real(eig(rand(1000, 'double', 'gpuArray')))));
else %CPU
result{n} = mean(real(eig(rand(1000))));
end
end
self.result=[gather([result{1:T}]), result{T+1:end} ];
According to this wait() should not be necessary with gather(), so I would leave the code without it.
Performing a dummy GPU calculation beforehand makes all GPU interations to keep a consistent performance now, so that solves the issue! It seems a somehow dirty way to warm the GPU up, but it works. Thanks for that!
I agree that gathering after each iteration may bring some overhead, but unfortunately in my case I need to gather() and clear() at each iteration since the results take most GPU memory, otherwise I get:
Out of memory on device. To view more detail about available memory on the GPU, use 'gpuDevice()'. If the problem persists, reset the GPU by calling 'gpuDevice(1)'.
Thanks for that!
You are quite welcome, but if you have a solution now to your original question, please Accept-click the answer.

Sign in to comment.

More Answers (1)

You cannot broadcast handle objects to a parpool. They simply get cloned and used as independent class instances on the workers. If you rewrite your class as a value class and execute using value class semantics,
a=a.execute(4);
then it should work.

7 Comments

Thanks for your answer Matt!
I would rather prefer to keep away from a value class because the arrays I process and the results are huge dense matrices, so might be better to avoid clones.
I could possibly workaround writting to properties in heavyTask() and just output the results and fetch them later. But I definitely need a method to keep track across instances of heavyTask() of the array elements that have been already processed. Is there any way to share information across workers on run time? Some sort of pseudo global variable? The only way I see at the minute is by writting to a common file in hdd, but it sounds like a very dirty hack...
There is no supported way to share information across workers at run time, other than to use a parallel queue https://www.mathworks.com/help/parallel-computing/parallel.pool.dataqueue.html or pollable queue to send data to the controller and have the controller send it to the remaining workers.
There are, of course, other methods such as using matFile() to write to a shared file, or taking advantage of operating system shared memory segments. Or using Java to transmit between processes (which is what is used for parallel queues. Or was; they have done some optimization for some kinds of sharing, such as using SLI to share between NVIDIA GPU.)
I would rather prefer to keep away from a value class because the arrays I process and the results are huge dense matrices, so might be better to avoid clones.
You cannot avoid cloning if you are sending data to a parpool. Cloning is done regardless of handle/value status.
In cases where the data to be cloned is the same for all workers, then there are some potential efficiencies.
I managed using the parallel.pool.DataQueue functionality to receive data from workers, but I failed to send updates to all workers. I used a combination of DataQueue to notify the client that a worker has finished an element to update the list of remaining items, then labSend to send the next available item to process to that worker and labReceive to receive it in the worker's workspace. Unfortunately, it seems labSend and labReceive does not work in combination with parfeval.
My final solution has been to send the minimum necessary data to all workers and remove references to class object in heavyTask(). I use a temporary binary file on hdd to store the next array index that must be processed by the next worker to become available, and I have the workers increasing that number by one unit everytime they process a new array item. I have used fopen/fread/fwrite/fclose to access the hdd as I only need to write a single byte and tic/toc showed to be x10 faster than matFile for this purpose. This is not the cleanest solution or the one I wish, but is the best I have found and it works.
I clock'ed the execution, and with the above approach a.execute(8) takes 5 seconds to complete. heavyTask() without parfeval takes 43 seconds on the GPU and 46 on the CPU, so I consider this a huge speed-up despite the ugliness of the code.
I can upload the modified class if someone considers it appropiate.
labSend() and labReceive() are for spmd only.
DataQueue and Pollable Data Queue are one-way objects. The way to be able to send data back to the client is to
  1. Have the client create a data queue before starting the workers
  2. The workers inherit the data queue. When they write to it, the client can read what was written
  3. In particular, the workers start by creating a data queue. And they write it to the data queue they inheritted.
  4. The client reads the data queue variables sent by the workers.
  5. The client can write to the data queue that it created in order to send data to the workers. The workers can write to the data queue that they created in order to send data to the client.
Sending data worker-to-worker is not supported using these queues... but I don't know what would happen if the client were to write the received queues to the other workers.
Update on the first come first served approach to split the workload: If the performance largely varies from worker to worker (the case of GPU and CPU) it is very likely that your best performing worker (GPU) finishes its task, there are no more array elements to process, and it has to wait for the slowest worker to finish. You may end up underutilising the fastest worker most of the time.
I saw big improvements with this approach in a computer whose CPU and GPU performs similarly. However, moving the code to another machine with a much better GPU, the performance was poorer than GPU only.
I think @Matt J's proposal, estimating a performance ratio CPU vs. GPU beforehand and splitting the workload beforehand would be a better approach as it should be possible to guarantee that the best performing worker never idles.

Sign in to comment.

Asked:

on 17 Jun 2021

Commented:

on 5 Jul 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!