Parfor solving optimization problems (Cplex) slower than for

Hello,
I am trying to solve a bunch of optimization problems in parallel using Matlab Parallel Toolbox 2018b on my client (Win10) + Matlab Distributed Server 2018b on my 3 node-cluster (Win7) with 52 workers. These are rather small problems, but there's hundreds of them so, theoretically, parfor should be helpful in this case.
I am reading these problems from .lp files into cell array and then I am solving them within parfor loop, as below:
% subp_array is 1xn cell array % with Cplex problems
nThreads = 1; % I don't see any time benefit of giving it more than 1 thread
parpool('MJSProfile1',nWorkers);
totalTime = tic;
parfor subp_index = 1:length(subp_array)
iterTime = tic;
prob = subp_array{subp_index}; % assigning subp_array{subp_index} to prob and working on it apparently speeds up calculations
prob.Param.parallel.Cur = -1; % set parallel option to opportunistic
prob.Param.threads.Cur = nThreads; % set number of threads per problem
prob.Param.mip.tolerances.mipgap.Cur = 0.01;
prob.solve();
% get time of particular iteration
elapsedTime{subp_index} = toc(iterTime);
end
% get time of entire loop
elapsedTotalTime = toc(totalTime);
The problem is that this parfor loop with 10 problems on 16 workers runs for 32 sec comparing go 1.5 sec (sic!) of regular for loop. When examinating time results, it comes out that elapsed time of particular iterations are very short, but overall loop time is still large...
These are values of elapsedTime array:
{[0.0275]} {[0.0317]} {[0.0274]} {[0.0314]} {[0.0695]} {[0.4816]} {[0.0808]} {[0.0343]} {[0.0399]} {[0.0845]}
which is in total less than 1 second!
Is there anything in the syntax that may cause time delays? I am using sliced variables, assigning prob firstly not to call the variable multiple times, no idea what else can be done... Apparently, if I run parfor with M = 0 (sequential), it gets the result immediately (in particular the difference is visible for few hundreds of problems). What may cause my parallel computing so slow?
Thanks in advance
Kasia

Answers (2)

Does the performance improve much / not much / not at all if you run the parfor loop a second time without closing the pool?
If the performance does improve a lot, then it's likely that the slow-down was caused by the parfor infrastructure having to work out that the code wasn't available, and attaching it to the pool. A message is printed when this occurs, or you can check the result of calling listAutoAttachedFiles:
listAutoAttachedFiles(gcp())
You can either live with that first-time slow-down, or attach the files up-front using addAttachedFiles
If the performance remains the same, perhaps the problem is the amount of data being transferred. Use ticBytes and tocBytes to investigate this. You could also experiment with stubbing-out most of the loop body. I.e. if you run a loop like this:
parfor subp_index = 1:length(subp_array)
prob = subp_array{subp_index}; % assigning subp_array{subp_index} to prob and working on it apparently speeds up calculations
end
how does that perform? That loop incurs the same amount of data transfer.

11 Comments

Thank you @Edric for quick answer!
Running the loop second time on already alive pool does not bring any difference, so that option is probably out.
I actually tried to measure data transfer but honestly, it does not tell me a lot... One thing I realized today is that if I work on array, like below, the data transfer from workers is much higher and that brings ~120s of time:
ticBytes(gcp);
parfor subp_index = 1:length(subp_array)
%[...]
subp_array{subp_index}.Param.parallel.Cur = -1; % set parallel option to opportunistic
subp_array{subp_index}.Param.threads.Cur = nThreads; % set number of threads per problem
subp_array{subp_index}.Param.mip.tolerances.mipgap.Cur = 0.01;
subp_array{subp_index}.solve();
%[...]
end
tocBytes(gcp);
%%%% results of tic-toc bytes:
% BytesSentToWorkers BytesReceivedFromWorkers
% __________________ ________________________
%
% 1 55246 93416
% 2 79582 1.0942e+05
% 3 21998 41128
% 4 2.6516e+05 3.3251e+05
% 5 9.2479e+05 1.207e+06
% 6 0 0
% 7 2.9081e+06 3.6206e+06
% 8 2.3895e+05 1.9847e+06
% 9 0 0
% 10 53502 78496
% 11 0 0
% 12 0 0
% 13 59758 86432
% 14 0 0
% 15 0 0
% 16 6.5977e+05 4.2877e+06
% Total 5.2668e+06 1.1841e+07
Now, if I assign the slice to the variable first, I am getting less data transfer from worker:
ticBytes(gcp);
parfor subp_index = 1:length(subp_array)
%[...]
prob = subp_array{subp_index};
prob.Param.parallel.Cur = -1; % set parallel option to opportunistic
prob.Param.threads.Cur = nThreads; % set number of threads per problem
prob.Param.mip.tolerances.mipgap.Cur = 0.01;
prob.solve();
%[...]
end
tocBytes(gcp);
%%%% results of tic-toc bytes:
% BytesSentToWorkers BytesReceivedFromWorkers (when assigned prob)
% __________________ ________________________
%
% 1 55239 14976
% 2 79575 14976
% 3 0 0
% 4 21991 14976
% 5 0 0
% 6 2.6515e+05 14976
% 7 9.2478e+05 14976
% 8 2.908e+06 18048
% 9 2.3894e+05 14976
% 10 0 0
% 11 53495 14976
% 12 59751 14976
% 13 0 0
% 14 0 0
% 15 6.5976e+05 14976
% 16 0 0
% Total 5.2667e+06 1.5283e+05
I checked also how much data is transfer when running assignment piece of code:
ticBytes(gcp);
parfor subp_index = 1:length(subp_array)
prob = subp_array{subp_index};
end
tocBytes(gcp);
And it takes most of the transfer and of the time probably (30.22s):
% BytesSentToWorkers BytesReceivedFromWorkers
% __________________ ________________________
%
% 1 55225 13440
% 2 79561 13440
% 3 0 0
% 4 0 0
% 5 21977 13440
% 6 2.6514e+05 13440
% 7 0 0
% 8 0 0
% 9 9.2477e+05 13440
% 10 2.908e+06 13440
% 11 2.3893e+05 13440
% 12 53481 13440
% 13 0 0
% 14 59737 13440
% 15 6.5975e+05 13440
% 16 0 0
% Total 5.2666e+06 1.344e+05
So I guess it's all about the transfer! The questions is, how to reduce it more?
Sadly, I don't see the difference now when playing with prob.Param.threads and prob.Param.parallel. It seemed to help before I changed this syntax with assignment above, now it brings same results... Maybe that would be kind of a tip as well?
Another interesting thing I noticed - when running parfor loop, even with 40 workers and 480 problems, CPU usage on machines does not get more than 3% (although CPU usage history shows that something is going on with these guys)
Hm, that's strange. It's not really a huge amount of data to transfer, even considering your workers are remote. When parfor sends data to workers, it has to do something equivalent to save and then load. Just to rule out any strangeness there, you could try profiling something like this, which should end up sending basically the same amount of data, but using trivial data types:
numRows = 100;
colsIn = 5e4;
colsOut = 3e2;
sliceIn = zeros(numRows, colsIn, 'uint8');
sliceOut = zeros(numRows, colsOut, 'uint8');
tb = ticBytes(gcp()); t = tic();
parfor idx = 1:numRows
sliceOut(idx,:) = sum(sliceIn(idx,:));
end
tocBytes(gcp(), tb); toc(t);
The other thing to try is using the MATLAB profiler on you code running on the client, just to make sure there's no other unexpected time being taken on the client. (Note that you'll definitely see some chunks of client time spent in the parfor implementation where basically the client is just sitting there waiting for results from the workers - that's completely normal).
What I am getting on this piece of code is "Unable to perform assignment because the size of the left side is 2-by-300 and the size of the right side is 2-by-1.". Which is interesting, because it should be actually 1-300 and 1-1, no? :) For loop works perfectly fine.
I am not aware of Matlab profiler (since I am using this soft since 2 weeks), so I'll try to get familiar with it! Thanks a lot Edric!
OK, I have run the profiler and this is what I got - Java method is taking most of the time. I found a few other Mathworks topics with the same issue but there was no clear solution... I additionally checked maxNumCompThreads parameter on my nodes - I suspected that it may be set to 1 and therefore it Matlab cannot run more threads at the same time; but it gets the number of cores, so this looks OK.
Below are screenshots from the profiler:
That profiling result is useful because it confirms that all the time on the client is spent waiting for the workers to return their results. (The q.poll statement is where the client gets blocked waiting for the worker results to show up).
The next thing to try is using "mpiprofile" to see where the time is being spent on the workers. Unfortunately, "mpiprofile" is designed more for spmd workflows than parfor, but it does work. Here's how you'd use it:
spmd, mpiprofile('on'); end
... % Call your script containing parfor
spmd, mpiprofile('viewer'); end
This will show you a profile view for all the workers. You can look through that to see if there are any unexpected hot-spots. This could be compared to the view you get when running the code under the profiler using for instead of parfor.
OK, I've seen mpiprofile method already but couldn't figure out how to run it with parfor. Now I know :)
Here are the results. I limited the number of problems to solve to decrease the overall time of computations, but anyway we can see that there are two outstanding functions - distcompdeserialize and cplexlink1280:
I've already reduced lots of the time utilized by cplexlink1280 by giving just 1 thread per each Cplex problem.
I went to distcompdeserialize...
...and deserialize -> line29:
Another observation is that this distcompdeserialize time gets decreased when running the parfor loop the second time on the same (alive) pool of workers. But the overall time of calculations remains the same...
There's definitely a chunk of time in distcompdeserialize - that's an internal PCT function that is used when transferring data from the client process to a MATLAB worker process.
However, looking at the absolute times - there's still a big chunk of time going somewhere. The total time taken by remoteParallelFunction (which is the worker-side wrapper for the body of a parfor loop) is only ~0.6 seconds, but (if I've understood correctly) the overall loop takes much longer. I don't really have any good way to explain that.
I would go back to trying to run a version of the parfor loop with the data transfer in place, but the actual computations stubbed out. My suspicion is that that will still take basically the same amount of time. This points to data transfer being the bottleneck - despite the actual number of bytes being transferred being relatively not that large...
Yup, the results of parfor loop with only data transfer yields similar time, also dedicated mostly to distcompdeserialize.
The entire parfor loop takes around 100s, and either I have no idea where this time comes from... When being executed, each Cplex problem prints a log and you can see how slowly new logs are appearing. It just looks like workers would wait for the next problem to catch. And when running large number of problems on all available cores, I noticed that task manager does not show any heavy calculations - CPU usage is just a few percent, the small windows indicating cores do not show any jumping green lines. Could it be on the machine site that it does not allow for solving the problems in parallel, but rather puts in in sequential queue?
Thanks a milion again for giving me so many useful tips!

Sign in to comment.

Egor Buiko
Egor Buiko on 14 Apr 2021
Edited: Egor Buiko on 14 Apr 2021
No need go to far for optimization problem solving. Even for :
X=zeros(1,10^6);
Parfor i=1:10^6, X(i) = i; End
Which by documentation forged for parallel pool, works two times slower then regular for.

Categories

Products

Release

R2018b

Asked:

on 17 Mar 2020

Edited:

on 14 Apr 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!