Internal function time increases with number of workers

Question

1 vote

When increasing parallelization there is typically the trade-off between distributing the computation and increasing communication overhead. Theoretically, the internal function time should be constant, as the I/O handling occurs before the function call and the combining of data from across cores occurs after the function call.

However, I seem to be experiencing an increase in the internal function time when parallelizing on my machine. It appears that degree of parallelization actually makes the function calls slower.

I made some example code to test this:

function test_parallel_timing()
g=gcp;
pools=1:g.NumWorkers;
mean_times=zeros(1,length(pools));
for pp=1:length(pools)
    num_pools=pools(pp);
      disp('   ');
      disp(['RUNNING ON ' num2str(num_pools) ' POOLS']);
      times=zeros(1,num_pools);
      parfor (ii=1:max(pools),num_pools)
          times(ii)=pool_function;
      end
      mean_times(pp)=mean(times);
      disp(['Mean function time: ' num2str(mean(times))]);
  end
figure
plot(pools,mean_times);
xlabel('Number of Pools');
ylabel('Mean Computation Time (sec)');
end
function function_time = pool_function
start_time=tic;
%Do some costly function
tmp=toeplitz(1:2000)*toeplitz(1001:3000);
function_time=toc(start_time);
disp(['  Function took ' num2str(function_time) ' seconds']);
end

Which results in the following plot:

The timing is designed completely internally to the function, which should give the time without any of the overhead. If my timing is indeed being done correctly, it appears that the functions are getting slower as a function of workers. What could cause this?

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

OCDER on 9 Jul 2018

Edited: OCDER on 9 Jul 2018

Open in MATLAB Online

0 votes

Interesting. It does seem the function time is increasing with number of workers, BUT, the total time to run the parfor loop does decrease. Not sure what's happening behind the scene of the Matlab job scheduler https://www.mathworks.com/help/distcomp/how-parallel-computing-products-run-a-job.html.

Perhaps the more appropriate way to measure the "observed function time" is by taking the total parfor loop time / number of iterations. See the following code:

function test_parallel_timing()
N = 400; %parpool iterations
g = gcp;
pools = 1:g.NumWorkers;
mean_times = zeros(1,length(pools));
total_times = zeros(1,length(pools));
for num_pools = 1:4
    fprintf('RUNNING ON %d POOLS\n', num_pools);
    times = zeros(1,N);
    a = tic;
    parfor (ii = 1:N, num_pools)
        times(ii) = pool_function;
    end
    total_times(num_pools) = toc(a);
    mean_times(num_pools) = mean(times);
    fprintf('Mean function time: %f\n\n', mean(times));
end
figure
plot(1:length(pools), mean_times, 'r', 1:length(pools), total_times/N, 'g');
xlabel('Number of Pools');
ylabel('Mean(red) or Total/N(green) Computation Time (sec)');
end
function function_time = pool_function
start_time = tic;
tmp = toeplitz(1:500)*toeplitz(1:500); %Do costly function
function_time = toc(start_time);
end

4 Comments
Show 2 older comments Hide 2 older comments

Michael on 9 Jul 2018

The total execution time is certainty the most important, but this shows that even if you make the IO to the workers extremely efficient (slicing data, using parconstant where necessary, etc.), then you still get degradation in performance by increasing cores. This adds an unknown factor to optimizing parallelization, which makes things difficult.

Hopefully Mathworks chimes in on this thread!

Mahboob Karimian on 28 Nov 2019

Edited: Mahboob Karimian on 28 Nov 2019

I had the same problem with my optimization task. In an HP server with a powerful Xeon Gold 6240 CPU, when I run my code without parallelization, every iteration takes 9 seconds with a 57% CPU load. When I use parallelization with 12 workers it loads the CPU only 9% in total and every iteration takes so long!

After some effort, I changed the number of threads from 1 to 8 in the local profile configurations, then the mentioned time reduced to 2.7 seconds. But the problem is that still, 2.7 seconds is so much for a powerful CPU like this. In my pc with an Intel Core i7 4770, without parallelization, the execution time for each iteration is just about 8 seconds.

I really couldn't find the problem yet, maybe it is related to the overhead or scheduler. But, at the end, this shows configuration is very important, and Mathworks documentation is not enough for a user to setup his/her machine to work in full speed.

Sign in to comment.

Internal function time increases with number of workers

0 Comments
Show -2 older comments Hide -2 older comments

Answers (1)

4 Comments
Show 2 older comments Hide 2 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Internal function time increases with number of workers

0 Comments Show -2 older comments Hide -2 older comments

Answers (1)

4 Comments Show 2 older comments Hide 2 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

4 Comments
Show 2 older comments Hide 2 older comments