Why are pool workers going inactive while many iterations remain?

I have a task wrapped in a parfor loop where text data files are read then converted to numeric values and downsampled to common signal frequencies and saved out as .mat files. There are many thousands of these text files of varying size between ~100 kilobytes and 3 gigabytes. This process is being run on a workstation with a i9-10980XE (18 cores/36 threads 128GB RAM) using the default cluster settings. Upon initial launch of the script, I can see via the resource monitor that the processes for all 18 workers are consuming 100% CPU for each of their threads.
If I check back on the process several hours later, the majority of the workers have stopped contributing with somewhere between 1-4 still running at 100%. All other workers still exist but are dormant as confirmed both by continual 0% CPU usage and by a reduction in the number of simultaneously changing files as compared to the number seen at launch.
At this point there will still be hundreds if not thousands of files left to process, so I am confused as to why these available workers are not being utilized. I can see no evidence of any errors that would have somehow forced the impacted workers to go dormant. I also have not seen any indication that hardware resources became a limiting factor. If a worker was to somehow stop mid-task I would also see partial log files created and not completed, but this is not happening. It appears that most workers are completing a small percentage of the overall task then going dormant while a smaller subset of the pool does the vast majority of the work.
I have seen mention of manually setting RangePartitionMethod and SubrangeSize pool parameters as a possible solution in other questions, but in those situations the issue sounds to be a result of relatively few expected iterations per worker and inconsistent work per iteration. In my situation given the considerably larger number of iterations compared to pool size, I am assuming that the number of files and distribution of file sizes is relatively consistent between workers.
UPDATE:
Based on Jeff and Walters answer I've modifed my script and it now utilizes all workers for the entire parfor execution.
I wrote a function that sets the work partitions such that each worker gets an inital batch which when combined across all workers would be 50% of the total work. The remaining 50% of the work is passed to workers as single items once they become available. This obviously would create some overhead vs pre-assigning, but in my application each execution on a worker is being passed only a file name and is otherwise quite isolated concerning IO. At least in my case if there is a negative impact to the overhead, it is more than made up for by all of the workers being kept at close to 100% utilization.
opts = poolOpts(n);
parfor(i = 1:n,opts)
% do stuff
end
function opts = poolOpts(iterations)
pool = gcp('nocreate'); %pool handle
if isempty(pool)
pool = parpool;
end
nw = pool.NumWorkers; %number of workers in pool
initChunk = floor(iterations/2); %number of iterations to assign at start = 50%
initWorkerChunk = floor(initChunk/nw); %number of iterations per worker to assign at start
poolPartitions = [repmat(initWorkerChunk,1,nw) ones(1,(iterations-(initWorkerChunk*nw)))]; %vector of
opts = parforOptions(gcp,"RangePartitionMethod",@(iterations,nw) poolPartitions);
end

1 Comment

How many files are being processed in total and how many have been fully processed when you check back?

Sign in to comment.

 Accepted Answer

What you would like is for any free processor to take up any waiting task (me too), but for some reason that's not how it works. Instead, parfor assigns all of the iterations to the different processors at the start, essentially making a little task queue for each processor. If too many slow tasks go into the queue for one processor (i.e., too many big files, in your case), that processor may still be chugging its way through its queue (with lots of unstarted tasks still in its queue) long after all of the other processors are finished and doing nothing. I think you have to use RangePartitionMethod to allocate tasks (files) more equally across processors.
Maybe there is something helpful in this question

6 Comments

parpool does not assign all of the iterations at the start.
parpool divides up approximately 2/3 of the iterations evenly between processors at the start, putting them into a queue.After that, it also takes about 1/4 of the iterations and divides them by the number of processors and puts the resulting chunks into the queue. It then takes all of the remaining iterations and puts them into the queue one iteration at a time.
(I posted more accurate fractions and test code a number of years ago; might take me a bit of time to find those though.)
So, if all tasks were the same duration, then cores would typically be asked to execute a large chunk, then a small chunk, and then it would be a race for mop-up tasks.
But of course sometimes tasks can be very different duration. If the different lengths occur at "random" and the number of tasks is large enough, then on average it would all work out.
But... especially when working with files, it is common that different sized tasks determined by file contents tend to cluster around similar iteration numbers, so it can be the case that one of the workers eats through its tasks much faster than other workers. In such a case, the worker might end up taking on several of the small chunks... and possibly it might get through all of the single iterations too. And then it might not have anything left to do.
Once a chunk of iterations has been started by a core, parfor does not later say, "Oh, I see you have some iterations left and there are workers without iterations; do not do such-and-such-an-iteration, I will have a different worker do it."
iterations to be done are not just queued one-by-one: chunks sizes are determined at the beginning (but individual chunks are not assigned to workers until a worker asks for more work.)
Thank you Jeff and Walter. That does help to clear up my misunderstanding of the expected behavior. It sounds like what I am after is a way to reduce the portion of the iterations that are assigned as chunks evenly at the start to maybe 50% and then set all other chunks as single iterations to more fluidly assign between available workers? I am expecting the reduced initial chunk size to hopefully prevent a single worker from recieving drastically more work than the others, and for all of the single iteration chunks to better even out the work if there is still some imbalance.
To keep more of the workers active longer, would it be better to do something like this pseudo code? My cycle time is curently several days, so I'm hoping to maybe get a hint that I am on the right track before I refactor and test.
n = %number of files (iterations)
pool = gcp('nocreate'); %pool handle
nw = pool.NumWorkers; %number of workers in pool
initChunk = floor(n/2); %number of iterations to assign at start = 50%
initWorkerChunk = floor(initChunk/nw); %number of iterations per worker to assign at start
poolPartitions = [repmat(initWorkerChunk,1,nw) ones(1,(n-(initWorkerChunk*nw)))]; %vector of
opts = parforOptions(gcp,"RangePartitionMethod",@(n,nw) poolPartitions);
parfor(i = 1:n,opts)
%do stuff
end
The RangePartitionMethod looks likely to be what you should be using.
You could also be considering switching from parfor to iterating parfeval()'s queuing up one iteration at a time. The disadvantage is that parfeval requires that variables be passed explicitly instead of being transferred implicitly as determined by parfor doing its flow analysis. And that can lead to greater data transfer (though using parallel.pool.Constant can help reduce that.)
@Daniel Bengtson I really don't have enough experience with RangePartitionMethod to comment on your pseudo code, sorry.
Here's another approach you might consider, though. Say you have 36 processors..then have your parfor go from 1 to 36. Each individual task in the parfor is to process a list of files, and you set up your 36 lists in advance so that each list takes approximately the same amount of time to process (I'm assuming you can do that based on the sizes of the files that are to be processed). Surely if the number of iterations is the same as the number of processors, then parfor will just give one iteration to each processor (is that right @Walter Roberson ?)
I no longer recall exactly what happens when the number of parfor iterations matches the number of cores.
... but I would suggest that at that point it might make more logical sense to parfeval() instead of parfor.
@Daniel Bengtson Thanks for the update--that looks like a very useful example.

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!