Efficient script to isolate one sub-dataset k-times.

1 view (last 30 days)
Hi everyone,
The idea is to divide the main dataset into k sub-datasets and delete 1 bin each time and remerge the other sub-datasets. In a nutshell, k bins will create k different sub-datasets. Since the number of bins mays not be a multiple of the number of row in the matrix (Bin k has often less rows), I had to use cell arrays.
Here is an illustration of the general idea for k = 2.
Question:
How can I remove the loop or make this code more efficient?
Here is my script.
------------------------------------------------------
Variables = rand(245,57);
Bin_numb = 11;
Bin_size = [1:floor(length(Variables)/Bin_numb):length(Variables) length(Variables)];
for i = 1:length(Bin_size)-1
if i == 1
Bin_Variables2{1} = Variables(Bin_size(2):Bin_size(end),:);
else
Bin_Variables2{i} = [Variables(Bin_size(1):Bin_size(i)-1,:); Variables(Bin_size(i+1):Bin_size(end),:)];
end
end
Thanks for your inputs
  2 Comments
Voss
Voss on 5 Mar 2024
Edited: Voss on 5 Mar 2024
Two observations:
  1. The last row of Variables is included as the last row of every element of Bin_Variables2 (because Bin_size(end) is always included).
  2. When size(Variables,1) is a multiple of Bin_numb, I expect you'd want each element of Bin_Variables2 to be the same size, but that's not what happens.
To illustrate:
Variables = rand(242,7);
Bin_numb = 11;
Bin_size = [1:floor(length(Variables)/Bin_numb):length(Variables) length(Variables)];
for i = 1:length(Bin_size)-1
if i == 1
Bin_Variables2{1} = Variables(Bin_size(2):Bin_size(end),:);
else
Bin_Variables2{i} = [Variables(Bin_size(1):Bin_size(i)-1,:); Variables(Bin_size(i+1):Bin_size(end),:)];
end
end
Observation 1: last row always the same:
fprintf('%36s%s\n','Last row of Variables: ',sprintf('%6.4g ',Variables(end,:)));
Last row of Variables: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156
for ii = 1:numel(Bin_Variables2)
fprintf('%36s%s\n',sprintf('Last row of Bin_Variables2{%d}: ',ii),sprintf('%6.4g ',Bin_Variables2{ii}(end,:)));
end
Last row of Bin_Variables2{1}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{2}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{3}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{4}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{5}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{6}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{7}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{8}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{9}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{10}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{11}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156
Observation 2: unequally sized result matrices even though 242 is a multiple of 11:
bin_sizes = cellfun(@(x)size(x,1),Bin_Variables2)
bin_sizes = 1×11
220 220 220 220 220 220 220 220 220 220 221
Vic
Vic on 7 Mar 2024
@Voss Thanks for these observations. @Manikanta Aditya & @Dyuman Joshi Thanks for your help. I haven't thought about the logical array. This is an elegant way to solve it.
Here is my current script.
Variables = rand(245,7);
Bin_numb = 11;
Bin_size = 1:floor(length(Variables)/Bin_numb):length(Variables);
if length(Variables)-Bin_size(end) <= 12
Bin_size(end) = length(Variables);
end
Bin_Variables2 = cell(1, length(Bin_size)-1);
for i = 1:length(Bin_size)-1
idx = true(length(Variables), 1);
idx(Bin_size(i):Bin_size(i+1)) = false;
Bin_Variables2{i} = Variables(idx, :);
end
for ii = 1:numel(Bin_Variables2)
fprintf('%1s%s\n',sprintf('Last row {%d}: ',ii),sprintf('%6.4g ',Bin_Variables2{ii}(end,:)));
end
bin_sizes = cellfun(@(x)size(x,1),Bin_Variables2)
length(Variables)-bin_sizes
Bin_size
Unrecognized function or variable 'Variables'.
Invalid expression. Check for missing or extra characters.
I forced a if condition to change Bin_size(end) = length(Variables) if size(Variables,1) is not a multiple of Bin_numb. Therefore, the last bin has floor(length(Variables)/Bin_numb) + mod(length(Variables),Bin_numb) rows (22+3) and I get this:
bin_sizes =
222 222 222 222 222 222 222 222 222 222 220
length(Variables)-bin_sizes =
23 23 23 23 23 23 23 23 23 23 25
It works.
As of the last row always being the same; it seems to be fine now but I still have some doubts about bin N-1 and its size.
Last row {1}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {2}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {3}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {4}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {5}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {6}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {7}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {8}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {9}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {10}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {11}: 0.1865 0.9516 0.07304 0.0887 0.697 0.9751 0.5142

Sign in to comment.

Accepted Answer

Manikanta Aditya
Manikanta Aditya on 4 Mar 2024
Moved: Dyuman Joshi on 4 Mar 2024
Just check out this code snippet which I can propose to make the code more efficient by using logical indexing instead of a loop:
Variables = rand(245,57);
Bin_numb = 11;
Bin_size = [1:floor(length(Variables)/Bin_numb):length(Variables) length(Variables)];
Bin_Variables2 = cell(1, length(Bin_size)-1);
for i = 1:length(Bin_size)-1
idx = true(size(Variables, 1), 1);
idx(Bin_size(i):Bin_size(i+1)-1) = false;
Bin_Variables2{i} = Variables(idx, :);
end
In this code, 'idx' is a logical array that is true for the rows of Variables that you want to keep. This approach avoids the need to concatenate arrays, which can be slow in MATLAB because it involves memory allocation. Instead, you’re just creating a logical index and using it to select the rows you want.
  2 Comments
Dyuman Joshi
Dyuman Joshi on 4 Mar 2024
Edited: Dyuman Joshi on 4 Mar 2024
@Manikanta Aditya, This looks good, though I would suggest to use size(Bin_size,1) instead of length(Bin_size).
" ... by using logical indexing instead of a loop:"
You are still using a loop.
@Vic, an important part of the code above is Preallocation, which is a good programming practice in MATLAB resulting in improved code performance.
Manikanta Aditya
Manikanta Aditya on 4 Mar 2024
Thanks @Dyuman Joshi for the reply back. My bad I didn't see the statement about the loop.

Sign in to comment.

More Answers (0)

Categories

Find more on Just for fun in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!