Internal function time increases with number of workers

When increasing parallelization there is typically the trade-off between distributing the computation and increasing communication overhead. Theoretically, the internal function time should be constant, as the I/O handling occurs before the function call and the combining of data from across cores occurs after the function call.
However, I seem to be experiencing an increase in the internal function time when parallelizing on my machine. It appears that degree of parallelization actually makes the function calls slower.
I made some example code to test this:
function test_parallel_timing()
g=gcp;
pools=1:g.NumWorkers;
mean_times=zeros(1,length(pools));
for pp=1:length(pools)
num_pools=pools(pp);
disp(' ');
disp(['RUNNING ON ' num2str(num_pools) ' POOLS']);
times=zeros(1,num_pools);
parfor (ii=1:max(pools),num_pools)
times(ii)=pool_function;
end
mean_times(pp)=mean(times);
disp(['Mean function time: ' num2str(mean(times))]);
end
figure
plot(pools,mean_times);
xlabel('Number of Pools');
ylabel('Mean Computation Time (sec)');
end
function function_time = pool_function
start_time=tic;
%Do some costly function
tmp=toeplitz(1:2000)*toeplitz(1001:3000);
function_time=toc(start_time);
disp([' Function took ' num2str(function_time) ' seconds']);
end
Which results in the following plot:
The timing is designed completely internally to the function, which should give the time without any of the overhead. If my timing is indeed being done correctly, it appears that the functions are getting slower as a function of workers. What could cause this?

Answers (1)

Interesting. It does seem the function time is increasing with number of workers, BUT, the total time to run the parfor loop does decrease. Not sure what's happening behind the scene of the Matlab job scheduler https://www.mathworks.com/help/distcomp/how-parallel-computing-products-run-a-job.html.
Perhaps the more appropriate way to measure the "observed function time" is by taking the total parfor loop time / number of iterations. See the following code:
function test_parallel_timing()
N = 400; %parpool iterations
g = gcp;
pools = 1:g.NumWorkers;
mean_times = zeros(1,length(pools));
total_times = zeros(1,length(pools));
for num_pools = 1:4
fprintf('RUNNING ON %d POOLS\n', num_pools);
times = zeros(1,N);
a = tic;
parfor (ii = 1:N, num_pools)
times(ii) = pool_function;
end
total_times(num_pools) = toc(a);
mean_times(num_pools) = mean(times);
fprintf('Mean function time: %f\n\n', mean(times));
end
figure
plot(1:length(pools), mean_times, 'r', 1:length(pools), total_times/N, 'g');
xlabel('Number of Pools');
ylabel('Mean(red) or Total/N(green) Computation Time (sec)');
end
function function_time = pool_function
start_time = tic;
tmp = toeplitz(1:500)*toeplitz(1:500); %Do costly function
function_time = toc(start_time);
end

4 Comments

The time does go down, but not proportionately as one would suspect given I/O overhead. Using the total parfor time/iterations includes I/O overhead, whereas the internal time should not.
Keep in mind that this is but toy example that does not remotely tax memory. For the large scale data processing we are doing, we can see increases of up to 50% in time per function execution.
I'd love to hear what thoughts Mathworks has on this.
I see. Is the total execution time less important than function internal time for your application? I guess we shall see what Mathworks says, since the inner workings of their MJS is unknown to public.
The total execution time is certainty the most important, but this shows that even if you make the IO to the workers extremely efficient (slicing data, using parconstant where necessary, etc.), then you still get degradation in performance by increasing cores. This adds an unknown factor to optimizing parallelization, which makes things difficult.
Hopefully Mathworks chimes in on this thread!
I had the same problem with my optimization task. In an HP server with a powerful Xeon Gold 6240 CPU, when I run my code without parallelization, every iteration takes 9 seconds with a 57% CPU load. When I use parallelization with 12 workers it loads the CPU only 9% in total and every iteration takes so long!
After some effort, I changed the number of threads from 1 to 8 in the local profile configurations, then the mentioned time reduced to 2.7 seconds. But the problem is that still, 2.7 seconds is so much for a powerful CPU like this. In my pc with an Intel Core i7 4770, without parallelization, the execution time for each iteration is just about 8 seconds.
I really couldn't find the problem yet, maybe it is related to the overhead or scheduler. But, at the end, this shows configuration is very important, and Mathworks documentation is not enough for a user to setup his/her machine to work in full speed.

Sign in to comment.

Categories

Find more on MATLAB Parallel Server in Help Center and File Exchange

Products

Release

R2017a

Asked:

on 5 Jul 2018

Edited:

on 28 Nov 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!