Efficient Way To Split Dataset Into Subsets

Hello,
I need to split a large dataset (DxN numeric array) into multiple subsets. I can use the code below (where groupIDs is an Nx1 matrix of integer IDs - the group to which each datapoint belongs).
groups = unique(groupIDs);
for i = 1:numel(groups)
tempData = data(:,groupIDs==groups(i));
%do work on tempData
end
However, 90% of the run time of the above code is spent just creating tempData! That amounts to over a minute every time I want to do this. Is there a more efficient way to split data by groupIDs? I tried splitapply() but it doesn't seem to be any faster.
Are there any matlab gurus out there that know a trick? Thanks!

5 Comments

how large is "large"?
500 x 3,000,000 (so a 12GB non-sparse double).
Greg
Greg on 24 Nov 2017
Edited: Greg on 24 Nov 2017
Use the second (or third? - I always have to guess and check between the two) output of unique(groupIDs).
Edit: This likely isn't faster, you still need a comparison check inside the loop. I always forget that part about the third output of unique.
12Gb? That is quite a lot. If this doesn't fit in memory, swapping to disk is the likely bottleneck ...
Thanks for the replies. I do have plenty of RAM left to spare, so it doesn't look like the hard drive is involved. Confirmed (re Greg) that using the output of unique is no better. For example, numeric indexing offers no improvement, and the indexing itself is not really the problem - it's probably the data copying:
disp('a. original (without "doing work")');
tic;
for i = 1:numel(groups)
tempData = data(:,groupIDs==groups(i));
end
toc
disp('b. numeric indexing');
idxs = cell(numel(groups));
for i = 1:numel(groups)
idxs{i} = find(groupIDs==groups(i));
end
tic;
for i = 1:numel(groups)
tempData = data(:,idxs{i});
end
toc
disp('c. logical operation alone');
tic;
for i = 1:numel(groups)
tempData = (groupIDs==groups(i));
end
toc
a. original (without "doing work")
Elapsed time is 4.590886 seconds.
b. numeric indexing
Elapsed time is 4.526391 seconds.
c. logical operation alone
Elapsed time is 0.066057 seconds.
There's gotta be another way - if I use a for loop with 3 million iterations it only takes 2 seconds longer.

Sign in to comment.

Answers (0)

Categories

Asked:

E
E
on 18 Nov 2017

Commented:

E
E
on 26 Nov 2017

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!