What determines the increase in speed for parfor vs. for?

The question is simply that. I know that for some loops with a lot of overhead, it can be slower. In some cases, it can be faster, due to multiple calculations being done in parallel.
The reason I ask this is because my parfor loop runs much faster than I'd expect. When I use
tic
for run = 1:n_runs
function_to_call
end
toc
The code takes around 140-145 seconds to run for n_runs=12. My parallel pool has 12 threads, which makes me think it should run 12 times faster, plus/minus a little bit to account for parfor overhead. What actually happens when I run
tic
parfor run = 1:n_runs
function_to_call
end
toc
is that the code takes about 1 second to run.
What is the explanation here? Am I missing something obvious, or is it a something deeper?

7 Comments

Just a thought: what happens if you use profile instead to see rate limiting steps in your function?
profile on
% your snippet
profile viewer
For the for loop, the profiler says that the code is spending almost all the time in the function which I'm calling. Within that, most of the time is spent evaluating if statements.
The profiler doesn't work as easily with parfor loops, as it just says that most of the time was spent evaluating "parallel function", and within that "java.util.concurrent.LinkedBlockingQueue".
If your problem is "embarrassingly parallelizable" (forgotten the correct technical term), as it seems from the code-snippets (which indicates that there's n_runs independent calls to function_to_call without any coupling between them). If that's the case you should be able to reduce it to n_runs = 1, right? If that's possible you might get something from peppering the code with tic-toc to make a QD manual profiling. I'm well aware that is a poor method in general, but might give some indication.
It gives some indication of where the time is being spent, but I already have a rough idea of that. What I'm curious to find out is what exactly made the parfor loop much more than 12x faster, despite only having 12x as many workers.
@Bjorn, that will be true in general, but not always. It is also possible that the function pulls a filename off of some queue (which must be outside of Matlab, otherwise parfor probably doesn't work correctly). The situation where this function would interact with an external device doesn't seem suited for parallelization, but doing something based on a random stream is also not out of the question.
roughly what does function_to_call do? I could imagine some cases where Matlab might do some trickery only with parfor, but not for.
For example, if the effect of the function is redundant (overwriting the same variables or files), for might assume that you have some reason to do it in sequence, whereas parfor explicitly knows that order doesn't matter, so it can just skip to the last iteration. (A human could see this; I doubt Matlab could)
Perhaps more realistically, some aspect of memory management might be similarly sped up. ("oh, order doesn't matter, so I can load this array in once and share it to each thread" or "let's just keep this file stream open")
[caveat: I know just enough about parallelization to be dangerous; I expect you'll explain the function and I'll have no clue why it's faster, but someone else might]
Essentially what it's doing is running through a binary vector of length N (N>1e5), creating subvectors of length n (n=10-100) and counting how many 1s and 0s there are in each subvector. The binary vectors are randomly generated and I'm interested in averages, so it's being run n_run times as a Monte Carlo simulation to generate these averages.

Sign in to comment.

Answers (0)

Categories

Products

Release

R2020b

Asked:

on 8 Dec 2020

Commented:

on 15 Dec 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!