How can I modify this nested loop to work as a parfor without requiring a ridiculously large array?

I am trying to nest several for loops in a parfor loop. I have even read the documentation and several other queries/replies before posting here.
I have a large dataset that I need to iterate over, calculating a property which would result in a prohibitively large array if I sent each answer to a separate element, and doing this on one cpu will take a prohibitively long time.
Instead, what I am doing is reducing the array by another property, and then combining the calculated results in bins of this second property (essentially making a histogram).
The parfor section of my code looks like this:
parfor i=1:box(1);
for j=1:box(2);
for k=1:box(3);
for l=1:box(1);
for m=1:box(2);
for n=1:box(3);
prop(horrendousfunction(i,j,k,l,m,n)) = prop(horrendousfunction(i,j,k,l,m,n)) + data(i,j,k)*data(l,m,n);
end
end
end
end
end
end
Trialling this on one cpu over i=j=k=l=m=n=[1:15] works fine and takes a few minutes.
The data array is the initial large array, but as written the code will iterate over every element of it many times within each parfor step, and therefore the data transmission I/O overhead shouldn't be too onerous with respect to the overall computation time. (Expected complete running time is ~1 month on 1 cpu, and I have several of these to do)
A few other (small, negligible I/O) arrays are also required.
horrendousfunction is essentially a way of mapping (i,j,k,l,m,n) into a single index h which has been previously split into bins earlier in the code. I will eventually want to plot(h,prop).
Now, after spending some time trawling the documentation and various previous questions on the topic, I realise that my initial efforts contravene the basic parfor assumption that j,k,l,m, and n (as components of the index of prop) should be constant within the parfor loop. I see that if I wanted to flesh out a behemoth array A(i,j,k,l,m,n) I could do so by making dummy variables and then combining it all with a command = A(i,:,:,:,:,:) at the end - but this will create the stupidly large array that I wish to avoid (I don't have enough HDD to store it).
Simply calculating trick = horrendousfunction(i,j,k,l,m,n) within the loop is also not an option, because I can't use trick as the index of prop either, since its value changes within the loop too.
Predetermining the complete mapping of [i,j,k,l,m,n] -> h is also not an option, since that would require a behemoth data array of its own, and require said array to be I/O'd to all the matlabpool workers. (And calculating it would take about as long as the original problem anyway) Also, the mapping is context-dependent, so I can't use the same mapping across the related family of problems I wish to solve.
Is there another way I can write the internals of this loop to take advantage of MATLAB's parallelism without requiring a behemoth output array?

8 Comments

Is prop 3dim matrix ? Whats the output (and its size) of horrendousfunction ?
prop is just a vector. Simple (N,1) recording of histogram values across N bins.
horrendousfunction just outputs a scalar, namely the bin which I want to add a value to for a given (i,j,k,l,m,n) subloop iteration. It just has that name because the operation takes up several lines' worth of space.
In an ideal world, the parallel processes would each total their own versions of prop, and then the returned results would be summed as they came in to give me the final value after all parfor iterations were complete.
How much of horrendousfunction can you pre-calculate without all of the inputs? For example if it involved 7*i^2 - 3*i*j - sin(m*pi) then you could break this up into a vector
mpi = sin((1:box(2)*pi);
and right within the "j" loop, calculate and store ijportion = 7*i^2 - 3*i*j. Then although you still need the values from all of the variables to do the full calculation, you could use ijportion + mpi(m) which would be much more efficient.
Unfortunately, at the fundamental level horrendousfunction is composed of things like abs(i-l), abs(j-m), and abs(k-n), along with various other constants.
I suppose I could reorder the loops to pair these and do as you suggest, making a small matrix for different separations ahead of time and then looking up the correct value of that matrix inside the innermost loop - but most of horrendousfunction would still need to be calculated inside the innermost loop, anyway, since it is generally a function of all three of the abs() functions. I'm not sure I'd save all that much time (but I'm willing to try it since it literally occurs billions of times).
Something like:
A = zeros(max(box));
for a = 1:max(box);
for b = 1:max(box);
A(a,b) = abs(a-b);
end
end
proceeding to the parfor loop and then calling A(i,l) or A(j,m) or A(k,n) in the innermost loop.
In any case, the problem still boils down to being able to add something to one specific element of a vector during the parfor loop... I still can't figure how to define prop so that when each worker finishes it will bring the subtotal back to be added rather than overwriting the previous vector.
Also, because I'm trying to write a script for a general problem, I can't specify how large the abs(i-l) values will actually get beforehand. They may actually exceed abs(1-max(box)) if necessary (although this is unlikely). I am experimenting with setting a cap on that though, so that it can be increased/decreased as required on an individual basis.
I've also thought about performing the calculation on some data, and saving the resulting prop(h) file to disk, before trying another range... Performing the calculation sequentially kind of defeats the purpose of parallelising it, though, and also throws up possible bugs if the bin width is changed (currently nbins and range are user-set early in the .m file).
prop is just a vector. Simple (N,1) recording of histogram values across N bins.
But what are they typical values/magnitudes of N? And what are the typical values/magnitudes of box(1), box(2), box(3)? Presumably N is much smaller than B=prod(box) ?
N is indeed smaller than prod(box), and depends on what the user sets it to be at the start of the script. I am thinking about 100 myself, so about 4000 times smaller.
You still haven't said what the typical magnitudes of box(i) will be.
Do you mean that N is typically 100 and therefore prod(box) is typically 4e5? So the box(i) are around 75? That doesn't sound like a challengingly large data set. Less than 2MB if data(i) are single precision floats. You could comfortably broadcast a separate copy of "data" to each lab very quickly and without straining memory.
The problem isn't distributing data(). You're right, that's currently about 4e5 elements - although I expect it will change once I implement the cutoff section properly (haven't described that here, it's something I'll work on once this works successfully and it might result in the second set of three loops being smaller or larger). data() is an array of doubles, btw, but still only a few MB. I can't see it causing that much overhead on its own.
Part of the problem is in the relationships between each element - there are ~ (4e5)^2=16e10 elements in that set (and potentially less/more once I get the final version working as described above).
The other issue is that the box vector elements themselves are dependent on the output of a third-party code and are not so far controllable... They depend heavily on the particular system being studied and a priori there is no way to tell how large or small they might be. Having said that, the two systems I've looked at so far have them ranging between 60 and 80 - but they are intrinsically related, with corresponding actual vector lengths and angles.

Sign in to comment.

 Accepted Answer

I think I might use SPMD for this, instead of parfor, as below.
So the first modification to make is to abandon 6-dimensional indexing i,j,k,l,m,n. Replace subscripts (i,j,k) everywhere at the top level by a linear index u and (l,m,n) by a linear index v, so that your histogram update reduces simply to
prop(horrendousfunction(u,v)) = ...
prop(horrendousfunction(u,v)) + data(u)*data(v);
You can call ind2sub() inside horrendousfunction as needed with marginal overhead, since it's already horrendous.
The next step is to write a modified and vectorized version of horrendousfunction(). Namely you will write
h=horrendousVectorized(U,v)
to take a vector U of linear indices u and a scalar linear index v and return the appropriate vector of h indices, i.e., h(k)=horrendousfunction(U(k),v) for all k. This should be easy if, as you say,the calculation of h just involves abs() and simple arithmetic. All of those ops are nicely vectorizable for fixed v.
Then it's all just,
B=prod(box);
p=zeros(N,1);
D=distributed(data(:));
U=distributed((1:B).');
for v=1:B
datav=data(v);
spmd
h=horrendousVectorized(U,v);
vals=D*datav;
p=p+accumarray(h,vals,[N,1]);
end
end
spmd
p=gplus(p,1);
end
prop=gather(p{1});

11 Comments

If data is really 75^3, as you allude to in your Comments above, then this variation, which does use parfor, might also be something to consider.
B=prod(box);
prop=zeros(1,N);
for h=1:N
I=cell(1,B);
J=cell(1,B);
parfor v=1:B
I{v}=find(horrendousVectorized(1:B,v)==h).';
J{v}(1:B)=v;
end
S=sparse([I{:}],[J{:}],true,B,B);
prop(h)=data(:).'*S*data;
end
This one (the original SPMD code this is a comment on) seems to work ok on a single CPU without the SPMD commands (or the gather at the end), but I can't tell easily whether it gives the same answer as my original nested loops since it works over the entire data(i,j,k) for half the problem, and over the first 1000 linearised data(prod(box),1) values instead of looping over the first 1000 (i,j,k) points twice.
The good news is that there is a decent speedup over the outermost loop, partly (~5%) due to my rewriting of horrendousfunction thanks to random help page searching (I found the ceil and floor functions - very useful) and partly (~93%) due to Matt's tip that I should vectorise the code. If I can verify that it gives the same answer as the first method (which I know is mathematically correct) then I'll take the 50x speedup!!
Unfortunately, when I try to run it with SPMD it crashes, I get the following error:
Error using spmd_feval (line 8)
Error detected on lab(s) 1 2 3 4
Caused by:
Undefined function 'accumarray' for input arguments of type 'codistributed'.
Error stack:
(No remote error stack)
That, and other errors I have seen trying to use distributed data with accumarray seem to imply that they don't mix. Since the vals vector is calculated from the D array in Matt's code, it too is distributed.
If on the other hand, I take out the distributed() calls, and just use the native arrays: D=data(:); U=(1:B).'; it does run in parallel using SPMD... however, it runs about 4x slower than the single-CPU code. The test here was over i=1:1000, which ran for ~63s on one CPU and ~230 over 4. Trying it over i=1:4000 to give it more runtime to allow for I/O overhead showed ~261s for 1 CPU and ~923s for 4 CPUS - another slowdown of a factor of 4 (see reply to Edric below).
Some other notes: I had previously been filtering results that had horrendous>N out via an if clause (slow, I know). Running as horrendousvectorised meant that the filter no longer worked (since it put all the horrendous values in at once). At first, I tried a simple filter horrendous(horrendous>N) = NaN; but it turns out that accumarray can't handle NaN inputs... so I then used horrendous(horrendous>N) = N + 1; and tweaked the appropriate predefinitions of array sizes to match. (Now I plot only the first N bins of the output)
I think I might have an idea where the 4x slowdown is coming from:
I don't intuitively see which of the lines are actually going to be parallelised using SPMD. If they are all running unparallelised, the SPMD block will just be spawning four copies of the same job.
I can't see how the prop = prop + accumarray line can possibly be parallelised the way it is currently written. What I think is happening is that one of the workers gets to that first, and then the others are locked out from modifying the value(s) of prop until the first worker is finished.
This results in the workers completing the same jobs in series, meaning we see a 4x runtime (with some overhead for distributing the data and writing back to the master copy).
I can't see what else is causing it.
If I'm reading the code with the sparse() call correctly, it assumes that there will be order prod(box) contributions to each bin value of h. This is not correct; the reason I'm iterating over all the values is that they're all non-zero. The problem with that is that if I was to write it all out as combinedprop(i,j,k,l,m,n) that would have ~ 1e11 non-zero elements and therefore be prohibitively large.
Fundamentally what I'm trying to do here is look at two sets of data for related systems, one where I expect there to be plenty of relationships significant enough to include in the histogram, and one where they are slightly more sparse (but not necessarily enough to actually call by that term).
The values in one set range over about (0,120), so their products are going to range over roughly (0,14400) - which means that I can't really afford to ignore any of them since the sheer number of low-valued products may sum to more than the "more significant" ones for a given bin in prop.
If I'm reading the code with the sparse() call correctly, it assumes that there will be order prod(box) contributions to each bin value of h.
Forget that proposal. I have a 3rd proposal below which should be more efficient and which reduces everything to a single parfor loop.
B=prod(box);
props=zeros(N,1);
parfor v=1:B
h=horrendousVectorized(1:B,v);
vals=data(:)*data(v);
props=props+accumarray(h,vals,[N,1]);
end
I can't see how the prop = prop + accumarray line can possibly be parallelised the way it is currently written.
I think you mean the p=p+accumarray line? You should be able to debug the SPMD version by using the following modified lines, but never mind. I think the parfor version above (my 3rd proposal) will be more efficient.
h=horrendousVectorized(U,v);
h=gather(h,labindex);
vals=D*datav;
vals=gather(vals,labindex);
My mistake was not converting the distributed arrays h,vals to variant arrays local to each worker.
Success! This works a treat, thank you Matt.
I just changed the existing single-CPU working version to parfor, and it indeed shows a reduction in runtime. Performing v=1:1000 takes ~65s on 1CPU, and ~27s on 4CPUs for a 60% cutdown in time. I assume this will get even closer to 25% as I increase the range of v towards prod(box).
What was originally going to take a month is now down to about 6 hours on my laptop, and even quicker on a cluster :)
Now, to invoke periodic boundary conditions for the first index of h... (half done already)
Some further optimization that could have an impact,
data=data(:);
Brange=1:B;
parfor v=1:B
h=horrendousVectorized(Brange,v);
vals=data*data(v);
props=props+accumarray(h,vals,[N,1]);
end
In particular, I think broadcasting the vector Brange=1:B instead of regenerating it for each v could save something significant. Otherwise, you end up basically adding the overhead of allocating a BxB array.
Is this different from setting
U = (1:prod(box)).';
and then using
h = horrendous(U,v);
as before? To get that block working initially, all I did was remove the distributed() part of the line defining U...
Actually, I'm hitting another set of slowdowns now. The early version looped over the same box twice (because that was simple to code). Now I'm moving the v points over the same box, but the positions I'm sampling from each point are a different box again, centred on v.
The bonus is that I still only have to pass data() to the workers, since olddata(box) had periodic boundary conditions and I can map any point in the newdata(box2) to an equivalent point in the olddata(box).
The difficulty is that I need to calculate newdata() inside the parfor loop, because I have to shift everything depending on which element of olddata() is v. Of course, it's easy to hack another for loop in, but that's slow...
parfor v=(1:B);
...
newdata = zeros(U); % where U is now U = (1:prod(new)).';
for w = U;
newdata(w) = olddata(sub2ind(box,x1(j),y1(j),z1(j)));
end
rho1rho2 = charge*rho1;
...
end
LMAO. Actually, I had had 1:prod(new) in place of U in this section of the code. Took it out, much faster (~150 times), more comparable to the original code timing (about 4 times slower, understandable given the number of sub2ind calls I have to make in the inside loop). Confirmation of what we both knew already.
I am now considering writing (or searching Central for) a faster parser to perform the duties of sub2ind.
Is this different from setting U = (1:prod(box)).';
No. It's the same.
Of course, it's easy to hack another for loop in, but that's slow...
Not sure if you've shown the inner for-loop as intended. The right hand side of the assignment doesn't depend on w at all, but rather on some mysterious new j variable. If it's meant to be that way, you can get rid of the for-loop altogether merely by doing
newdata(U) = olddata(sub2ind(box,x1(j),y1(j),z1(j)));
If instead the for-loop was supposed to look like this,
for w = U;
newdata(w) = olddata(sub2ind(box,x1(w),y1(w),z1(w)));
end
then you can still get rid of it by replacing it by the vectorized command
newdata(U) = olddata(sub2ind(box,x1(U),y1(U),z1(U)));
You're right, I'd forgotten which index I was using. I knew there had to be a way of vectorising that command! Just couldn't see it.
Thanks again :)
I'll update the description with the new timing information once the first serious calculation is complete (it's running now).
Hmm... running the for loop for a test set took 89.483 seconds over 4 CPUs, but running the single-line vectorised command took 89.486 seconds over the same 4 CPUs. Not much difference there (and it probably comes from the other applications open on the laptop).

Sign in to comment.

More Answers (1)

Does it work to do something like this:
prop = zeros(1, N);
parfor ...
...
tmp = zeros(1, N);
tmp(horrendousfunction(i,j,k,l,m,n)) = data(i,j,k)*data(l,m,n);
prop = prop + tmp;
end

3 Comments

Will data(i,j,k) be sliced in this case? Since j and k are not broadcast variables, it would seem not. Also data(l,m,n) looks like it won't be sliced.
I've tried this out, and while it does allow the parfor loop to execute, the resulting task takes longer - at least for a test set of results.
To be fair, the first test set was rather small, only from 1:10 in each loop, which takes ~9.8s on one CPU in my laptop. It took ~41s across all four. I figured this was due to I/O overhead with loading data(i,j,k). The data perfectly replicated the single CPU job (so there wasn't any issue with each parfor worker overwriting prop when it finished).
A second test, 1:20 in each loop, runs for ~10m on one CPU. Ran for ~40m across all four. I am wondering if there is a timing issue with writing prop out from each worker/iteration?
See comment in reply to Matt above regarding parallelisation of prop = prop + tmp {or accumarray()};.

Sign in to comment.

Categories

Asked:

on 13 Oct 2013

Commented:

on 17 Oct 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!