MATLAB Answers

Dividing grouped data into equally sized subgroups

3 views (last 30 days)
Hello community,
for my current project i need to divide the values of a table for each group (parent-group) into 10 eqaully sized subgroups. The values within a parent-group are arranged hierarchically. By creating the subgroups, the hierarchical order has to remain, so that the first subgroup of a parent-group contains the lowest 10% of the values, the second subgroup of a parent group contains the next larger 10%-stack (>10% - 20%) of the values and so on. In the following, i will give an example:
Data = array2table(rand(1000,1));
Data.Properties.VariableNames{1} = 'Value';
ParentGroups = [1:1:10]';
Data.ParentGroups = repelem(ParentGroups,100);
Data = sortrows(Data,{'ParentGroups','Value'});
My first approach was to create the deciles of each group, which separate each parent-group into 10 equally sized parts. The problem here is, that the function i used simply outputs the values of the deciles, but there is no reference to their location in the original data. For creating the deciles, I used the following code:
decileFnc = @(x) prctile(x,[0:10:100]);
deciles = varfun(decileFnc,Data,'GroupingVariables','ParentGroups','InputVariables',1);
deciles.Fun_Value = string(deciles.Fun_Value);
deciles = [deciles(:,1:2),regexp(deciles.Fun_Value, '\s+', 'split')];
doubleFnc = @(x) double(x);
deciles = [deciles(:,1:2),varfun(doubleFnc,deciles,'InputVariables',3:13)];
The last block of code is used to unstack the output of varfun, i just shared it for the sake of completeness.
I'm really hoping for some help here and appreciate every answer.
  2 Comments
Maximilian Hauschel
Maximilian Hauschel on 5 Aug 2021
Hello Peter,
you are right, i am trying exactly what you've just described. Sorry for not making that clear.
The tag doesn't necessarily need to be the value of the within-group decile, but an index that identifies the corresponding decile (for example, every value that's below the 6th within-group decile should be assigned the number 6)
And thank you for the hint!

Sign in to comment.

Accepted Answer

Maximilian Hauschel
Maximilian Hauschel on 6 Aug 2021
I struggled for a while, but finally figured out a way to solve the problem on my own. In case anyone except me cares, here is my solution:
G = groupcounts(Data,2);
G.GroupCount = double(G.GroupCount);
I used groupconts to identify the number of elements in each parent group
fnc = @(x) prctile(x,10:10:100);
deciles = splitapply(fnc,Data.Value,Data.ParentGroups);
A function to find the deciles within the parent groups, this time I used splitapply
D=[];
Created an empty array that will contain an indicator of the within-group decile for every value
for i=1:10
for s=1:10
Data.Value(height(D)+1:sum(G{1:i,2}),1) <= deciles(i,s);
D(height(D)+1:height(D)+sum(ans),1) = s;
end
end
for every parentgroup i, the loop tests if its values are less than or equal to a decile deciles(i,s) and than assigns the corrosponding indicator s to the array D.
Data(:,3) = array2table(D);
Connecting the values with the indicators .

More Answers (1)

Peter Perkins
Peter Perkins on 6 Aug 2021
>> x = rand(16,1);
>> g = repmat((1:2)',8,1);
>> t = table(g,x)
t =
16×2 table
g x
_ _______
1 0.81472
2 0.90579
1 0.12699
2 0.91338
1 0.63236
2 0.09754
1 0.2785
2 0.54688
1 0.95751
2 0.96489
1 0.15761
2 0.97059
1 0.95717
2 0.48538
1 0.80028
2 0.14189
Here's a function that takes a vector and returns the quartile bin in which each value falls.
function qbin = fun(x)
q = quantile(x,[0 .25 .5 .75 1]);
qbin = discretize(x,q);
end
Apply that function to each group of rows in the table.
>> tq = rowfun(@fun, t, "GroupingVariables","g")
tq =
16×3 table
g GroupCount Var3
_ __________ ____
1 8 3
1 8 1
1 8 2
1 8 2
1 8 4
1 8 1
1 8 4
1 8 3
2 8 3
2 8 3
2 8 1
2 8 2
2 8 4
2 8 4
2 8 2
2 8 1
rowfun has returned its output table in group order. To attach the quartile bin numbers to the correct rows of the original data, use the second output of sort.
>> [~,ord] = sort(t.g);
>> t.qbin(ord) = tq.Var3
t =
16×3 table
g x qbin
_ _______ ____
1 0.81472 3
2 0.90579 3
1 0.12699 1
2 0.91338 3
1 0.63236 2
2 0.09754 1
1 0.2785 2
2 0.54688 2
1 0.95751 4
2 0.96489 4
1 0.15761 1
2 0.97059 4
1 0.95717 4
2 0.48538 2
1 0.80028 3
2 0.14189 1
To make the result easier to confirm, sort by x within group; notice that the quartile bin numbers are as expected.
sortrows(t,["g" "qbin"])
ans =
16×3 table
g x qbin
_ _______ ____
1 0.12699 1
1 0.15761 1
1 0.63236 2
1 0.2785 2
1 0.81472 3
1 0.80028 3
1 0.95751 4
1 0.95717 4
2 0.09754 1
2 0.14189 1
2 0.54688 2
2 0.48538 2
2 0.90579 3
2 0.91338 3
2 0.96489 4
2 0.97059 4
  1 Comment
Maximilian Hauschel
Maximilian Hauschel on 7 Aug 2021
Thank you Peter for sharing your solution. Didn't had the discretize function on my radar. That makes the whole thing way easyer. I'am really happy about the receipt of your answer, because it shows that you also cared about my problem. Additionally I learned something new, so agian, thanks :)

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!