Parfor solving optimization problems (Cplex) slower than for

Question

0 votes

Hello,

I am trying to solve a bunch of optimization problems in parallel using Matlab Parallel Toolbox 2018b on my client (Win10) + Matlab Distributed Server 2018b on my 3 node-cluster (Win7) with 52 workers. These are rather small problems, but there's hundreds of them so, theoretically, parfor should be helpful in this case.

I am reading these problems from .lp files into cell array and then I am solving them within parfor loop, as below:

% subp_array is 1xn cell array % with Cplex problems
nThreads = 1; % I don't see any time benefit of giving it more than 1 thread
parpool('MJSProfile1',nWorkers);
totalTime = tic;
parfor subp_index = 1:length(subp_array)
    iterTime = tic;
    
    prob = subp_array{subp_index};  % assigning subp_array{subp_index} to prob and working on it apparently speeds up calculations
   
    prob.Param.parallel.Cur = -1; % set parallel option to opportunistic
    prob.Param.threads.Cur = nThreads; % set number of threads per problem
    prob.Param.mip.tolerances.mipgap.Cur = 0.01;
    prob.solve();
    
    % get time of particular iteration
    elapsedTime{subp_index} = toc(iterTime);
end
% get time of entire loop
elapsedTotalTime = toc(totalTime); 

The problem is that this parfor loop with 10 problems on 16 workers runs for 32 sec comparing go 1.5 sec (sic!) of regular for loop. When examinating time results, it comes out that elapsed time of particular iterations are very short, but overall loop time is still large...

These are values of elapsedTime array:

  {[0.0275]}    {[0.0317]}    {[0.0274]}    {[0.0314]}    {[0.0695]}    {[0.4816]}    {[0.0808]}    {[0.0343]}    {[0.0399]}    {[0.0845]}

which is in total less than 1 second!

Is there anything in the syntax that may cause time delays? I am using sliced variables, assigning prob firstly not to call the variable multiple times, no idea what else can be done... Apparently, if I run parfor with M = 0 (sequential), it gets the result immediately (in particular the difference is visible for few hundreds of problems). What may cause my parallel computing so slow?

Thanks in advance

Kasia

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Edric Ellis on 17 Mar 2020

Open in MATLAB Online

1 vote

Does the performance improve much / not much / not at all if you run the parfor loop a second time without closing the pool?

If the performance does improve a lot, then it's likely that the slow-down was caused by the parfor infrastructure having to work out that the code wasn't available, and attaching it to the pool. A message is printed when this occurs, or you can check the result of calling listAutoAttachedFiles:

listAutoAttachedFiles(gcp())

You can either live with that first-time slow-down, or attach the files up-front using addAttachedFiles

If the performance remains the same, perhaps the problem is the amount of data being transferred. Use ticBytes and tocBytes to investigate this. You could also experiment with stubbing-out most of the loop body. I.e. if you run a loop like this:

parfor subp_index = 1:length(subp_array)
    prob = subp_array{subp_index};  % assigning subp_array{subp_index} to prob and working on it apparently speeds up calculations
end

how does that perform? That loop incurs the same amount of data transfer.

11 Comments
Show 9 older comments Hide 9 older comments

Katarzyna Furmanska on 18 Mar 2020

Open in MATLAB Online

Thank you @Edric for quick answer!

Running the loop second time on already alive pool does not bring any difference, so that option is probably out.

I actually tried to measure data transfer but honestly, it does not tell me a lot... One thing I realized today is that if I work on array, like below, the data transfer from workers is much higher and that brings ~120s of time:

ticBytes(gcp);
parfor subp_index = 1:length(subp_array)
    %[...]
    subp_array{subp_index}.Param.parallel.Cur = -1; % set parallel option to opportunistic
    subp_array{subp_index}.Param.threads.Cur = nThreads; % set number of threads per problem
    subp_array{subp_index}.Param.mip.tolerances.mipgap.Cur = 0.01;
    subp_array{subp_index}.solve();
    %[...]
end
tocBytes(gcp);
%%%% results of tic-toc bytes:
%              BytesSentToWorkers    BytesReceivedFromWorkers 
%              __________________    ________________________
% 
%     1                 55246                    93416       
%     2                 79582               1.0942e+05       
%     3                 21998                    41128       
%     4            2.6516e+05               3.3251e+05       
%     5            9.2479e+05                1.207e+06       
%     6                     0                        0       
%     7            2.9081e+06               3.6206e+06       
%     8            2.3895e+05               1.9847e+06       
%     9                     0                        0       
%     10                53502                    78496       
%     11                    0                        0       
%     12                    0                        0       
%     13                59758                    86432       
%     14                    0                        0       
%     15                    0                        0       
%     16           6.5977e+05               4.2877e+06       
%     Total        5.2668e+06               1.1841e+07   

Now, if I assign the slice to the variable first, I am getting less data transfer from worker:

ticBytes(gcp);
parfor subp_index = 1:length(subp_array)
    %[...]
    prob = subp_array{subp_index}; 
    prob.Param.parallel.Cur = -1; % set parallel option to opportunistic
    prob.Param.threads.Cur = nThreads; % set number of threads per problem
    prob.Param.mip.tolerances.mipgap.Cur = 0.01;
    prob.solve();
    %[...]
end
tocBytes(gcp);
%%%% results of tic-toc bytes:
%              BytesSentToWorkers    BytesReceivedFromWorkers (when assigned prob)
%              __________________    ________________________
% 
%     1                 55239                    14976       
%     2                 79575                    14976       
%     3                     0                        0       
%     4                 21991                    14976       
%     5                     0                        0       
%     6            2.6515e+05                    14976       
%     7            9.2478e+05                    14976       
%     8             2.908e+06                    18048       
%     9            2.3894e+05                    14976       
%     10                    0                        0       
%     11                53495                    14976       
%     12                59751                    14976       
%     13                    0                        0       
%     14                    0                        0       
%     15           6.5976e+05                    14976       
%     16                    0                        0       
%     Total        5.2667e+06               1.5283e+05  

I checked also how much data is transfer when running assignment piece of code:

ticBytes(gcp);
parfor subp_index = 1:length(subp_array)
    prob = subp_array{subp_index}; 
end
tocBytes(gcp);

And it takes most of the transfer and of the time probably (30.22s):

%              BytesSentToWorkers    BytesReceivedFromWorkers
%              __________________    ________________________
% 
%     1                 55225                   13440        
%     2                 79561                   13440        
%     3                     0                       0        
%     4                     0                       0        
%     5                 21977                   13440        
%     6            2.6514e+05                   13440        
%     7                     0                       0        
%     8                     0                       0        
%     9            9.2477e+05                   13440        
%     10            2.908e+06                   13440        
%     11           2.3893e+05                   13440        
%     12                53481                   13440        
%     13                    0                       0        
%     14                59737                   13440        
%     15           6.5975e+05                   13440        
%     16                    0                       0        
%     Total        5.2666e+06               1.344e+05   

So I guess it's all about the transfer! The questions is, how to reduce it more?

Sadly, I don't see the difference now when playing with prob.Param.threads and prob.Param.parallel. It seemed to help before I changed this syntax with assignment above, now it brings same results... Maybe that would be kind of a tip as well?

Edric Ellis on 25 Mar 2020

There's definitely a chunk of time in distcompdeserialize - that's an internal PCT function that is used when transferring data from the client process to a MATLAB worker process.

However, looking at the absolute times - there's still a big chunk of time going somewhere. The total time taken by remoteParallelFunction (which is the worker-side wrapper for the body of a parfor loop) is only ~0.6 seconds, but (if I've understood correctly) the overall loop takes much longer. I don't really have any good way to explain that.

I would go back to trying to run a version of the parfor loop with the data transfer in place, but the actual computations stubbed out. My suspicion is that that will still take basically the same amount of time. This points to data transfer being the bottleneck - despite the actual number of bytes being transferred being relatively not that large...

Katarzyna Furmanska on 25 Mar 2020

Yup, the results of parfor loop with only data transfer yields similar time, also dedicated mostly to distcompdeserialize.

The entire parfor loop takes around 100s, and either I have no idea where this time comes from... When being executed, each Cplex problem prints a log and you can see how slowly new logs are appearing. It just looks like workers would wait for the next problem to catch. And when running large number of problems on all available cores, I noticed that task manager does not show any heavy calculations - CPU usage is just a few percent, the small windows indicating cores do not show any jumping green lines. Could it be on the machine site that it does not allow for solving the problems in parallel, but rather puts in in sequential queue?

Thanks a milion again for giving me so many useful tips!

Sign in to comment.

Answer 2

Egor Buiko on 14 Apr 2021

Edited: Egor Buiko on 14 Apr 2021

0 votes

No need go to far for optimization problem solving. Even for :

X=zeros(1,10^6);

Parfor i=1:10^6, X(i) = i; End

Which by documentation forged for parallel pool, works two times slower then regular for.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Parfor solving optimization problems (Cplex) slower than for

0 Comments
Show -2 older comments Hide -2 older comments

Answers (2)

11 Comments
Show 9 older comments Hide 9 older comments

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Parfor solving optimization problems (Cplex) slower than for

0 Comments Show -2 older comments Hide -2 older comments

Answers (2)

11 Comments Show 9 older comments Hide 9 older comments

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

11 Comments
Show 9 older comments Hide 9 older comments

0 Comments
Show -2 older comments Hide -2 older comments