Parallelization data organization/transfer
4 views (last 30 days)
Show older comments
I've got Nf files recorded from Nu field units that contain Nd data samples from Ns sensors. Since each recording is a different duration, I've been storing them in a larger struct array "DM" using a ".val" container and pre-loading in RAM for speed. I also save "DM" as a .mat file for easier/quicker loading later.
I pull the "data" within a double for loop from the ".val" sub-container. The main function takes "data" as an input, calculates many (Np) parameters for each sample and spits back a parameter array, or "values matrix" VM, which is then saved/aggregated into a larger "values matrix array" (VMA) similarly structured to "DM". There's other inputs like unit calibration factors and threshold constants that get passed to mainfunction and other outputs similar to VMA, but this simplified section hopefully gets things across:
% <section to pre-allocate VMA here>
for ui = 1:Nu
for fi = 1:Nf
data = DM(ui,fi).val; % size(data) = [Nd,Ns]
VM = mainfunction(data); % size(VM) = [Nd,Np]
VMA(ui,fi).val = VM;
end
end
I've only ever pegged 20-30% CPU utilization, so I was hoping to spin each "ui" batch of Nf recordings to a separate (local) worker. Parfor does technically work, but struggles to keep all CPUs utilized and blows out the RAM needed, so there's lots of copying going on and the order everything is run is definitely a bit random. Not against using it, but seems like it lacks some kind of control knobs that are needed here to be efficient.
Spmd seems maybe a better way to control things, and I can make "DM" a constant then get rid of the "for ui" loop and use data = DM(labindex,fi).val; but I can't see how to save all my "VM" results back to the client. Spmd can't aggregate that VM array back into VMA b/c of the ".val" method, so I'm looking for a better way to structure things for spmd (or, perhaps, parfeval) to work.
Also, RAM usage. This is maybe on the edge of "big data". Once I have "DM" loaded & "VMA" pre-allocated, I'm sitting near 110GB of RAM utilized. With single-thread, I didn't need much more than that and was able to just store the results into their slot in VMA w/ a minimal extra amount of RAM needed for mainfunction. So far, though, my parallelize efforts have required a ton more RAM behind the scenes, even trying DM as a "parallel.pool.constant". It'd be nice to limit it more closely to that 110GB. I did try making DM distributed, but spmd can't use the .val way of extracting data from a distributed "DM", and has the same limitations on communicating VMA results back to the client.
I'm looking for a way for each worker to pull from a different section of a common source dataset, and then write back a larger batch of calculated parameters to a different section of a common results dataset. That seems simple enough, so perhaps it's my data structure that is getting in the way? None of the workers need to communicate anything besides the results back to the client, so this seems like a rather vanilla parallelization effort, I just can't quite see the correct clearing in this forest yet.
0 Comments
Answers (2)
Edric Ellis
on 7 Dec 2023
parfor is designed to "just work" for a wide range of cases, but it's true that sometimes you need a bit more control. parforOptions is one approach if you want to be in charge of partitioning the work, but I suspect that will not help you. You've also tried parallel.pool.Constant, which gives a degree more control over data transfer, but might actually get in the way here of partitioning the data. (In parfor, your code will "slice" DM which means that only the relevant portions of DM will be sent to each worker - i.e. each worker does not get a full copy of DM).
Even though your serial implementation uses only ~25% CPU, this doesn't necessarily mean that you can speed things up using parallelisation. Some algorithms are "memory bound" (rather than "compute bound"), which means that the limiting factor is the speed at which your system can get data to the CPU from main memory (and back again). It's frustratingly hard to tell when you're in this situation, but one simple experiment you can run is to run two MATLAB clients at the same time running your serial code. (This assumes you can do that without having to page memory to disk using swap space...) If they're both able to run at the normal speed without interfering with each other, then it's a fair bet that you can get benefit from parallelisation.
From what you've said, I suspect the key to making things work efficiently is to ensure that only the process operating on a given portion of DM actually loads that data into memory. Most likely this means not loading it at the client and transferring it to the workers; rather, you want to arrange for the workers to load the data they need directly. How much this affects things depends on how many workers you're running.
One way to do that is to use spmd, a bit like this:
spmd
% Ignoring the case where Nu isn't equally divisible...
nuPerWorker = Nu / numlabs;
uStart = 1 + (labindex-1) * nuPerWorker;
uStop = uStart + nuPerWorker;
myDM = loadRangeOfDM(uStart, uStop);
myVMA = preallocateVMA();
for ui = 1:nuPerWorker
% operate on myDM(ui,...)
end
end
% Collect results from workers using Composite indexing
VMA = [VMA{:}];
This avoids loading DM on the client; it does duplicate VMA in memory at the end when retrieving the results.
See Also
Categories
Find more on Parallel Computing Fundamentals in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!