Main Content

Split Data into Groups and Calculate Statistics

This example shows how to split data from the patients.mat data file into groups. Then it shows how to calculate mean weights and body mass indices, and variances in blood pressure readings, for the groups of patients. It also shows how to summarize the results in a table.

Load Patient Data

Load sample data gathered from 100 patients.

load patients

Convert Gender and SelfAssessedHealthStatus to categorical arrays.

Gender = categorical(Gender);
SelfAssessedHealthStatus = categorical(SelfAssessedHealthStatus);
  Name                            Size            Bytes  Class          Attributes

  Age                           100x1               800  double                   
  Diastolic                     100x1               800  double                   
  Gender                        100x1               330  categorical              
  Height                        100x1               800  double                   
  LastName                      100x1             11616  cell                     
  Location                      100x1             14208  cell                     
  SelfAssessedHealthStatus      100x1               560  categorical              
  Smoker                        100x1               100  logical                  
  Systolic                      100x1               800  double                   
  Weight                        100x1               800  double                   

Calculate Mean Weights

Split the patients into nonsmokers and smokers using the Smoker variable. Calculate the mean weight for each group.

[G,smoker] = findgroups(Smoker);
meanWeight = splitapply(@mean,Weight,G)
meanWeight = 2×1


The findgroups function returns G, a vector of group numbers created from Smoker. The splitapply function uses G to split Weight into two groups. splitapply applies the mean function to each group and concatenates the mean weights into a vector.

findgroups returns a vector of group identifiers as the second output argument. The group identifiers are logical values because Smoker contains logical values. The patients in the first group are nonsmokers, and the patients in the second group are smokers.

smoker = 2x1 logical array


Split the patient weights by both gender and status as a smoker and calculate the mean weights.

G = findgroups(Gender,Smoker);
meanWeight = splitapply(@mean,Weight,G)
meanWeight = 4×1


The unique combinations across Gender and Smoker identify four groups of patients: female nonsmokers, female smokers, male nonsmokers, and male smokers. Summarize the four groups and their mean weights in a table.

[G,gender,smoker] = findgroups(Gender,Smoker);
T = table(gender,smoker,meanWeight)
T=4×3 table
    gender    smoker    meanWeight
    ______    ______    __________

    Female    false       130.32  
    Female    true        130.92  
    Male      false       180.04  
    Male      true        181.14  

T.gender contains categorical values, and T.smoker contains logical values. The data types of these table variables match the data types of Gender and Smoker respectively.

Calculate body mass index (BMI) for the four groups of patients. Define a function that takes Height and Weight as its two input arguments, and that calculates BMI.

meanBMIfcn = @(h,w)mean((w ./ (h.^2)) * 703);
BMI = splitapply(meanBMIfcn,Height,Weight,G)
BMI = 4×1


Group Patients Based on Self-Reports

Calculate the fraction of patients who report their health as either Poor or Fair. First, use splitapply to count the number of patients in each group: female nonsmokers, female smokers, male nonsmokers, and male smokers. Then, count only those patients who report their health as either Poor or Fair, using logical indexing on S and G. From these two sets of counts, calculate the fraction for each group.

[G,gender,smoker] = findgroups(Gender,Smoker);
S = SelfAssessedHealthStatus;
I = ismember(S,{'Poor','Fair'});
numPatients = splitapply(@numel,S,G);
numPF = splitapply(@numel,S(I),G(I));
ans = 4×1


Compare the standard deviation in Diastolic readings of those patients who report Poor or Fair health, and those patients who report Good or Excellent health.

stdDiastolicPF = splitapply(@std,Diastolic(I),G(I));
stdDiastolicGE = splitapply(@std,Diastolic(~I),G(~I));

Collect results in a table. For these patients, the female nonsmokers who report Poor or Fair health show the widest variation in blood pressure readings.

T = table(gender,smoker,numPatients,numPF,stdDiastolicPF,stdDiastolicGE,BMI)
T=4×7 table
    gender    smoker    numPatients    numPF    stdDiastolicPF    stdDiastolicGE     BMI  
    ______    ______    ___________    _____    ______________    ______________    ______

    Female    false         40          10          6.8872            3.9012        21.672
    Female    true          13           5          5.4129            5.0409        21.669
    Male      false         26           8          4.2678            4.8159        26.578
    Male      true          21           3          5.6862             5.258        26.458

See Also


Related Topics