parfeval inside a class method does not update a class property

Question

1 vote

I have a class with a very time consuming method heavyTask(). This method operates on each element of an array stored as a property. Since the execution on one element does not depend on others, I want to speed up the execution by using both local CPU and GPU in parallel. An abstraction of this class would be:

classdef myClass < matlab.mixin.Copyable
    properties
        items
        result
    end
    
    methods
        function self = myClass()
        end
        
        function execute(self, N)
            self.items = 1:N;
            
            f(1) = parfeval(@heavyTask, 0, self, false);
            f(2) = parfeval(@heavyTask, 0, self, true);
            fetchOutputs(f);
        end
        
        function heavyTask(self, gpu)
            while not(isempty(self.items))
                n = self.items(1);
                self.items(1) = [];
                if gpu
                    self.result(n) = gather(mean(real(eig(rand(1000, 'double', 'gpuArray')))));
                else
                    self.result(n) = mean(real(eig(rand(1000))));
                end
            end
        end
    end
end

I use parfeval() to run two instances of heavyTask() in parallel. One uses gpuArray and the other does not. The workload is split with the self.items list of array items not processed yet. heavyTask() will check this list, pick one element, and remove it from the list. I cannot predict the number of array elements that each worker can process, so this first come first served idea is the only approach I came up with.

This is how I create the class and execute the method:

a = myClass;
a.execute(4);
a.result

Unfortunately, this is what I get:

a.result
ans =
     []

However, if I replace parfeval() and fetchOutputs() with heavyTask(self, true) I get the desired behaviour:

a.result
ans =
    0.4959    0.4969    0.4891    0.4778

I have not found any question that has answered my issue. The closest match I have got is this, but it does not seem to address my problem.

Is this the expected behaviour? Is there any workaround I can implement in my class?

Many thanks in advance for your help!

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Matt J on 17 Jun 2021

Edited: Matt J on 17 Jun 2021

Open in MATLAB Online

0 votes

Couldn't you do something like this?

        function execute(self, N)
            
            gd=gpuDevice;
            
            %Test execution times
            tic;
              result(1)=mean(real(eig(rand(1000, 'double', 'gpuArray')))); 
              wait(gd);
            tgpu=toc;
            tic;
               result(2)=mean(real(eig(rand(1000))));
            tcpu=toc;
            
           
            T=floor(tcpu*(N-2)/(tcpu+tgpu));
            
            %Divided 
            parfor n=3:N
               
                if n<=T %GPU
                    result(n) = gather(mean(real(eig(rand(1000, 'double', 'gpuArray')))));
                else %CPU
                    result(n) = mean(real(eig(rand(1000))));
                end 
            end
            
            self.result=result;
            
        end

10 Comments
Show 8 older comments Hide 8 older comments

Matt J on 30 Jun 2021

Open in MATLAB Online

No, should I?

Since it was in the code I suggested to you, naturally I'm going to assume that it is in your code as well... In any case I don't know if gather() has the same synchronizing effect as wait(), so you should probably try it. You might also try doing a trivial initial operation on the GPU before the tic...toc in case it needs to be warmed-up or something, after your reset.

One other thing that occurs to me is that you should probably not be gathering() the result of your GPU computations after every iteration, since this might have undesirable overhead. You should probably leave everything on the GPU until all computations are done and then transfer it at the very end. This can be done by modifying as below,

            parfor n=3:N
               
                if n<=T %GPU
                    result{n} = mean(real(eig(rand(1000, 'double', 'gpuArray')))));
                else %CPU
                    result{n} = mean(real(eig(rand(1000))));
                end 
            end
            
            self.result=[gather([result{1:T}]), result{T+1:end} ];

Alberto Reig on 5 Jul 2021

According to this wait() should not be necessary with gather(), so I would leave the code without it.

Performing a dummy GPU calculation beforehand makes all GPU interations to keep a consistent performance now, so that solves the issue! It seems a somehow dirty way to warm the GPU up, but it works. Thanks for that!

I agree that gathering after each iteration may bring some overhead, but unfortunately in my case I need to gather() and clear() at each iteration since the results take most GPU memory, otherwise I get:

Out of memory on device. To view more detail about available memory on the GPU, use 'gpuDevice()'. If the problem persists, reset the GPU by calling 'gpuDevice(1)'.

Matt J on 5 Jul 2021

Thanks for that!

You are quite welcome, but if you have a solution now to your original question, please Accept-click the answer.

Sign in to comment.

Answer 2

Matt J on 17 Jun 2021

Edited: Matt J on 17 Jun 2021

Open in MATLAB Online

1 vote

You cannot broadcast handle objects to a parpool. They simply get cloned and used as independent class instances on the workers. If you rewrite your class as a value class and execute using value class semantics,

a=a.execute(4);

then it should work.

7 Comments
Show 5 older comments Hide 5 older comments

Alberto Reig on 18 Jun 2021

I managed using the parallel.pool.DataQueue functionality to receive data from workers, but I failed to send updates to all workers. I used a combination of DataQueue to notify the client that a worker has finished an element to update the list of remaining items, then labSend to send the next available item to process to that worker and labReceive to receive it in the worker's workspace. Unfortunately, it seems labSend and labReceive does not work in combination with parfeval.

My final solution has been to send the minimum necessary data to all workers and remove references to class object in heavyTask(). I use a temporary binary file on hdd to store the next array index that must be processed by the next worker to become available, and I have the workers increasing that number by one unit everytime they process a new array item. I have used fopen/fread/fwrite/fclose to access the hdd as I only need to write a single byte and tic/toc showed to be x10 faster than matFile for this purpose. This is not the cleanest solution or the one I wish, but is the best I have found and it works.

I clock'ed the execution, and with the above approach a.execute(8) takes 5 seconds to complete. heavyTask() without parfeval takes 43 seconds on the GPU and 46 on the CPU, so I consider this a huge speed-up despite the ugliness of the code.

I can upload the modified class if someone considers it appropiate.

Walter Roberson on 18 Jun 2021

labSend() and labReceive() are for spmd only.

DataQueue and Pollable Data Queue are one-way objects. The way to be able to send data back to the client is to

Have the client create a data queue before starting the workers
The workers inherit the data queue. When they write to it, the client can read what was written
In particular, the workers start by creating a data queue. And they write it to the data queue they inheritted.
The client reads the data queue variables sent by the workers.
The client can write to the data queue that it created in order to send data to the workers. The workers can write to the data queue that they created in order to send data to the client.

Sending data worker-to-worker is not supported using these queues... but I don't know what would happen if the client were to write the received queues to the other workers.

Alberto Reig on 22 Jun 2021

Update on the first come first served approach to split the workload: If the performance largely varies from worker to worker (the case of GPU and CPU) it is very likely that your best performing worker (GPU) finishes its task, there are no more array elements to process, and it has to wait for the slowest worker to finish. You may end up underutilising the fastest worker most of the time.

I saw big improvements with this approach in a computer whose CPU and GPU performs similarly. However, moving the code to another machine with a much better GPU, the performance was poorer than GPU only.

I think @Matt J's proposal, estimating a performance ratio CPU vs. GPU beforehand and splitting the workload beforehand would be a better approach as it should be possible to guarantee that the best performing worker never idles.

Sign in to comment.

parfeval inside a class method does not update a class property

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

10 Comments
Show 8 older comments Hide 8 older comments

More Answers (1)

7 Comments
Show 5 older comments Hide 5 older comments

Categories

Products

Tags

Community Treasure Hunt

parfeval inside a class method does not update a class property

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

10 Comments Show 8 older comments Hide 8 older comments

More Answers (1)

7 Comments Show 5 older comments Hide 5 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

10 Comments
Show 8 older comments Hide 8 older comments

7 Comments
Show 5 older comments Hide 5 older comments