dummyvar

Create dummy variables

Syntax

D = dummyvar(group)

Description

D = dummyvar(group) returns a matrix D containing zeros and ones, whose columns are dummy variables for the grouping variables in group. Each column of group is a single grouping variable, with values indicating category levels. The rows of group represent observations across all variables.

Examples

collapse all

Create Dummy Variables from Categorical Grouping Variable

Open Live Script

Create a column vector of categorical data specifying color types.

Colors = {'Red';'Blue';'Green';'Red';'Green';'Blue'};
Colors = categorical(Colors);

Create dummy variables for each color type.

D = dummyvar(Colors)

D = 6×3

     0     0     1
     1     0     0
     0     1     0
     0     0     1
     0     1     0
     1     0     0

The columns in D correspond to the levels in Colors. For example, the first column of dummyvar corresponds to the first level, 'Blue', in Colors.

Display the category levels of Colors.

categories(Colors)

ans = 3x1 cell
    {'Blue' }
    {'Green'}
    {'Red'  }

Create Dummy Variables from Numeric Grouping Variables

Open Live Script

Create a matrix group of data containing the effects of two machines and three operators on a process.

machine = [1 1 1 1 2 2 2 2]';
operator = [1 2 3 1 2 3 1 2]';
group = [machine operator]

group = 8×2

     1     1
     1     2
     1     3
     1     1
     2     2
     2     3
     2     1
     2     2

Create dummy variables of the data in group.

D = dummyvar(group)

D = 8×5

     1     0     1     0     0
     1     0     0     1     0
     1     0     0     0     1
     1     0     1     0     0
     0     1     0     1     0
     0     1     0     0     1
     0     1     1     0     0
     0     1     0     1     0

The first two columns of D represent observations of machine 1 and machine 2, respectively. The remaining columns represent observations of the three operators.

Create Dummy Variables from Multiple Grouping Variables

Open Live Script

Create a cell array of phone types and a numeric vector of area codes.

phone = {'mobile';'landline';'mobile';'mobile';'mobile';'landline';'landline'};
codes = [802 802 603 603 802 603 802]';

Because the area code data has two levels (rather than 802 levels corresponding to the integers 1:802), convert codes to a categorical vector.

newcodes = categorical(codes);

Combine the phone and newcodes grouping variables into the cell array group.

group = {phone,newcodes};

Create dummy variables for the groups in group.

D = dummyvar(group)

D = 7×4

     1     0     0     1
     0     1     0     1
     1     0     1     0
     1     0     1     0
     1     0     0     1
     0     1     1     0
     0     1     0     1

The first two columns of D correspond to the phone types, and the last two columns correspond to the area codes.

One-Hot Decode Dummy Variables

Open Live Script

Create dummy variables, and then decode them back into the original data.

Create a column vector of categorical data specifying color types.

colorsOriginal = ["red";"blue";"red";"green";"yellow";"blue"];
colorsOriginal = categorical(colorsOriginal)

colorsOriginal = 6x1 categorical
     red 
     blue 
     red 
     green 
     yellow 
     blue

Determine the classes in the categorical vector.

classes = categories(colorsOriginal);

Create dummy variables for each color type by using the dummyvar function.

dummyColors = dummyvar(colorsOriginal)

dummyColors = 6×4

     0     0     1     0
     1     0     0     0
     0     0     1     0
     0     1     0     0
     0     0     0     1
     1     0     0     0

Decode the dummy variables in the second dimension by using the onehotdecode function.

colorsDecoded = onehotdecode(dummyColors,classes,2)

colorsDecoded = 6x1 categorical
     red 
     blue 
     red 
     green 
     yellow 
     blue

The decoded variables match the original color types.

Input Arguments

collapse all

`group` — Grouping variables
positive integer vector | categorical column vector | cell array | positive integer matrix

Grouping variables, specified as a positive integer vector or categorical column vector representing levels within a single variable, a cell array containing one or more grouping variables, or a positive integer matrix representing levels within multiple variables.

If group is a categorical vector, then the groups and their order match the output of the categories function applied to group. If group is a numeric vector, then dummyvar assumes that the groups and their order are 1:max(group). In this respect, dummyvar treats a numeric grouping variable differently from grp2idx. For information on the order of groups within grouping variables, see Grouping Variables.

Example: [2 1 1 1 2 3 3 2]'

Example: {Origin,Cylinders}

Data Types: single | double | categorical | cell

Output Arguments

collapse all

`D` — Dummy variables
numeric matrix

Dummy variables, returned as an n-by-s numeric matrix, where n is the number of rows of group and s is the sum of the number of levels in each column of group. From left to right, the columns of D are dummy variables created from the first column of group, followed by dummy variables created from the second column of group, and so on.

Data Types: single | double

Tips

Use dummy variables in regression analysis and ANOVA to indicate values of categorical predictors.
dummyvar treats NaN values and undefined categorical levels in group as missing data and returns NaN values in D.
If a column of ones is introduced in the matrix D, then the resulting matrix X = [ones(size(D,1),1) D] is rank deficient. If group has multiple columns, then the matrix D itself is rank deficient because dummy variables produced from any column of group always sum to a column of ones. Regression and ANOVA calculations often address this issue by eliminating one dummy variable (implicitly setting the coefficients for dropped columns to zero) from each group of dummy variables produced by a column of group.
If group is a numeric vector with levels that do not correspond exactly to the integers 1:max(group), first convert the data to a categorical vector by using categorical. You can then pass the result to dummyvar. For an example, see Create Dummy Variables from Multiple Grouping Variables.

Alternative Functionality

Alternatively, use onehotencode to encode data labels. Consider using onehotencode instead of dummyvar in these cases:

To encode a table of categorical data labels
To specify the dimension to expand for encoding the data labels

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

This function fully supports tall arrays. For more information, see Tall Arrays.

Version History

Introduced before R2006a

dummyvar

Syntax

Description

Examples

Create Dummy Variables from Categorical Grouping Variable

Create Dummy Variables from Numeric Grouping Variables

Create Dummy Variables from Multiple Grouping Variables

One-Hot Decode Dummy Variables

Input Arguments

`group` — Grouping variables
positive integer vector | categorical column vector | cell array | positive integer matrix

Output Arguments

`D` — Dummy variables
numeric matrix

Tips

Alternative Functionality

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

Version History

See Also

Topics

dummyvar

Syntax

Description

Examples

Create Dummy Variables from Categorical Grouping Variable

Create Dummy Variables from Numeric Grouping Variables

Create Dummy Variables from Multiple Grouping Variables

One-Hot Decode Dummy Variables

Input Arguments

group — Grouping variables positive integer vector | categorical column vector | cell array | positive integer matrix

Output Arguments

D — Dummy variables numeric matrix

Tips

Alternative Functionality

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

Version History

See Also

Topics

`group` — Grouping variables
positive integer vector | categorical column vector | cell array | positive integer matrix

`D` — Dummy variables
numeric matrix

Tall Arrays
Calculate with arrays that have more rows than fit in memory.