Why are pool workers going inactive while many iterations remain?
8 views (last 30 days)
I have a task wrapped in a parfor loop where text data files are read then converted to numeric values and downsampled to common signal frequencies and saved out as .mat files. There are many thousands of these text files of varying size between ~100 kilobytes and 3 gigabytes. This process is being run on a workstation with a i9-10980XE (18 cores/36 threads 128GB RAM) using the default cluster settings. Upon initial launch of the script, I can see via the resource monitor that the processes for all 18 workers are consuming 100% CPU for each of their threads.
If I check back on the process several hours later, the majority of the workers have stopped contributing with somewhere between 1-4 still running at 100%. All other workers still exist but are dormant as confirmed both by continual 0% CPU usage and by a reduction in the number of simultaneously changing files as compared to the number seen at launch.
At this point there will still be hundreds if not thousands of files left to process, so I am confused as to why these available workers are not being utilized. I can see no evidence of any errors that would have somehow forced the impacted workers to go dormant. I also have not seen any indication that hardware resources became a limiting factor. If a worker was to somehow stop mid-task I would also see partial log files created and not completed, but this is not happening. It appears that most workers are completing a small percentage of the overall task then going dormant while a smaller subset of the pool does the vast majority of the work.
I have seen mention of manually setting RangePartitionMethod and SubrangeSize pool parameters as a possible solution in other questions, but in those situations the issue sounds to be a result of relatively few expected iterations per worker and inconsistent work per iteration. In my situation given the considerably larger number of iterations compared to pool size, I am assuming that the number of files and distribution of file sizes is relatively consistent between workers.
Based on Jeff and Walters answer I've modifed my script and it now utilizes all workers for the entire parfor execution.
I wrote a function that sets the work partitions such that each worker gets an inital batch which when combined across all workers would be 50% of the total work. The remaining 50% of the work is passed to workers as single items once they become available. This obviously would create some overhead vs pre-assigning, but in my application each execution on a worker is being passed only a file name and is otherwise quite isolated concerning IO. At least in my case if there is a negative impact to the overhead, it is more than made up for by all of the workers being kept at close to 100% utilization.
opts = poolOpts(n);
parfor(i = 1:n,opts)
% do stuff
function opts = poolOpts(iterations)
pool = gcp('nocreate'); %pool handle
pool = parpool;
nw = pool.NumWorkers; %number of workers in pool
initChunk = floor(iterations/2); %number of iterations to assign at start = 50%
initWorkerChunk = floor(initChunk/nw); %number of iterations per worker to assign at start
poolPartitions = [repmat(initWorkerChunk,1,nw) ones(1,(iterations-(initWorkerChunk*nw)))]; %vector of
opts = parforOptions(gcp,"RangePartitionMethod",@(iterations,nw) poolPartitions);
Jeff Miller on 28 Jul 2023
What you would like is for any free processor to take up any waiting task (me too), but for some reason that's not how it works. Instead, parfor assigns all of the iterations to the different processors at the start, essentially making a little task queue for each processor. If too many slow tasks go into the queue for one processor (i.e., too many big files, in your case), that processor may still be chugging its way through its queue (with lots of unstarted tasks still in its queue) long after all of the other processors are finished and doing nothing. I think you have to use RangePartitionMethod to allocate tasks (files) more equally across processors.
Maybe there is something helpful in this question