How can I subsample from dataset array with multiple position-specific conditions
Show older comments
Dear all,
I am trying to write a script that will allow me to sub-sample lists of n items from a large sample of N items. The sub-sampling needs to be done according to a set of position-specific constraints and I am struggling to implement it. I have a large sample of three-letter words (e.g. [a] [b] [c]). There are 12 possible letters that can occur in the beginning [a], 6 possible letters that can occur in the middle [b], and the same 12 letters occurring in the beginning [a] also occur last [c] (but [a] and [c] are never the same). From my larger sample of 288 words, I want to subsample 8 sub-lists of 36 items that meet the following conditions: 1. Each sub-list needs to be unique (i.e. no replacement) 2. Letters in position [a] need to occur 3 times in each list 3. Letters in position [b] need to occur 6 times in each list 4. Letters in position [c] need to occur 3 times in each list
I'm finding this problem quite difficult to script. I've written the script below with the idea of it running a WHILE loop until each counter reaches its maximum. As it is, the script will exit the loop as soon as one of the counters exceeds their maximum. Instead, I would want the WHILE loop to exit if any counter exceeds their maximum but then select another random item and keep selecting an item until all counters have reached their maximum. Any ideas/suggestions on how to achieve this (perhaps a different approach without the need for counters) would be very much appreciated!
Many thanks for your help!
%%1. Get data
sample = dataset('file','sample.txt', ...
'Delimiter','\t','ReadObsNames',true);
sample.a = nominal(sample.a);
sample.b = nominal(sample.b);
sample.c = nominal(sample.c);
%%2. Generate stimuli lists
% number of sublists needed
for l = 1:8;
% create an empty dataset array
sublist = dataset([], [], [], 'VarNames', {'a', 'b', 'c'});
% set all counters to zeros
% counters for position a
ctr_a_letter1 = 0;
ctr_a_letter2 = 0;
ctr_a_letter3 = 0;
ctr_a_letter4 = 0;
ctr_a_letter5 = 0;
ctr_a_letter6 = 0;
ctr_a_letter7 = 0;
ctr_a_letter8 = 0;
ctr_a_letter9 = 0;
ctr_a_letter10 = 0;
ctr_a_letter11 = 0;
ctr_a_letter12 = 0;
% counters for position b
ctr_b_letter13 = 0;
ctr_b_letter14 = 0;
ctr_b_letter15 = 0;
ctr_b_letter16 = 0;
ctr_b_letter17 = 0;
ctr_b_letter18 = 0;
% counters for position c
ctr_c_letter1 = 0;
ctr_c_letter2 = 0;
ctr_c_letter3 = 0;
ctr_c_letter4 = 0;
ctr_c_letter5 = 0;
ctr_c_letter6 = 0;
ctr_c_letter7 = 0;
ctr_c_letter8 = 0;
ctr_c_letter9 = 0;
ctr_c_letter10 = 0;
ctr_c_letter11 = 0;
ctr_c_letter12 = 0;
% number of items needed per list
for i = 1:36
% select one item randomly
item = datasample(sample, 1, 'Replace', false);
while ctr_a_letter1 < 3 && ...
ctr_a_letter2 < 3 && ...
ctr_a_letter3 < 3 && ...
ctr_a_letter4 < 3 && ...
ctr_a_letter5 < 3 && ...
ctr_a_letter6 < 3 && ...
ctr_a_letter7 < 3 && ...
ctr_a_letter8 < 3 && ...
ctr_a_letter9 < 3 && ...
ctr_a_letter10 < 3 && ...
ctr_a_letter11 < 3 && ...
ctr_a_letter12 < 3 && ...
ctr_b_letter13 < 6 && ...
ctr_b_letter14 < 6 && ...
ctr_b_letter15 < 6 && ...
ctr_b_letter16 < 6 && ...
ctr_b_letter17 < 6 && ...
ctr_b_letter18 < 6 && ...
ctr_c_letter1 < 3 && ...
ctr_c_letter2 < 3 && ...
ctr_c_letter3 < 3 && ...
ctr_c_letter4 < 3 && ...
ctr_c_letter5 < 3 && ...
ctr_c_letter6 < 3 && ...
ctr_c_letter7 < 3 && ...
ctr_c_letter8 < 3 && ...
ctr_c_letter9 < 3 && ...
ctr_c_letter10 < 3 && ...
ctr_c_letter11 < 3 && ...
ctr_c_letter12 < 3
if item.a == 'letter1'
ctr_a_letter1 = ctr_a_letter1 + 1;
elseif item.a == 'letter2'
ctr_a_letter2 = ctr_a_letter2 + 1;
elseif item.a == 'letter3'
ctr_a_letter3 = ctr_a_letter3 + 1;
elseif item.a == 'letter4'
ctr_a_letter4 = ctr_a_letter4 + 1;
elseif item.a == 'letter5'
ctr_a_letter5 = ctr_a_letter5 + 1;
elseif item.a == 'letter6'
ctr_a_letter6 = ctr_a_letter6 + 1;
elseif item.a == 'letter7'
ctr_a_letter7 = ctr_a_letter7 + 1;
elseif item.a == 'letter8'
ctr_a_letter8 = ctr_a_letter8 + 1;
elseif item.a == 'letter9'
ctr_a_letter9 = ctr_a_letter9 + 1;
elseif item.a == 'letter10'
ctr_a_letter10 = ctr_a_letter10 + 1;
elseif item.a == 'letter11'
ctr_a_letter11 = ctr_a_letter11 + 1;
elseif item.a == 'letter12'
ctr_a_letter12 = ctr_a_letter12 + 1;
end
if item.b == 'letter13'
ctr_b_letter13 = ctr_b_letter13 + 1;
elseif item.b == 'letter14'
ctr_b_letter14 = ctr_b_letter14 + 1;
elseif item.b == 'letter15'
ctr_b_letter15 = ctr_b_letter15 + 1;
elseif item.b == 'letter16'
ctr_b_letter16 = ctr_b_letter16 + 1;
elseif item.b == 'letter17'
ctr_b_letter17 = ctr_b_letter17 + 1;
elseif item.b == 'letter18'
ctr_b_letter18 = ctr_b_letter18 + 1;
end
if item.c == 'letter1'
ctr_c_letter1 = ctr_c_letter1 + 1;
elseif item.c == 'letter2'
ctr_c_letter2 = ctr_c_letter2 + 1;
elseif item.c == 'letter3'
ctr_c_letter3 = ctr_c_letter3 + 1;
elseif item.c == 'letter4'
ctr_c_letter4 = ctr_c_letter4 + 1;
elseif item.c == 'letter5'
ctr_c_letter5 = ctr_c_letter5 + 1;
elseif item.c == 'letter6'
ctr_c_letter6 = ctr_c_letter6 + 1;
elseif item.c == 'letter7'
ctr_c_letter7 = ctr_c_letter7 + 1;
elseif item.c == 'letter8'
ctr_c_letter8 = ctr_c_letter8 + 1;
elseif item.c == 'letter9'
ctr_c_letter9 = ctr_c_letter9 + 1;
elseif item.c == 'letter10'
ctr_c_letter10 = ctr_c_letter10 + 1;
elseif item.c == 'letter11'
ctr_c_letter11 = ctr_c_letter11 + 1;
elseif item.c == 'letter12'
ctr_c_letter12 = ctr_c_letter12 + 1;
end
% append the item to the subset
sublist = [sublist; item];
% remove selected item from the sample
sample(item.Properties.ObsNames,:) = [];
clear item
break
end
end
end
1 Comment
Greg
on 25 Oct 2017
I'd have to spend way more time for a complete solution, but I think randperm() might be a good start.
Answers (0)
Categories
Find more on Surrogate Optimization in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!