Retain dummy variable labels from converting categorical to dummyvar
Show older comments
Hi there,
I have 19 categorical columns which I have converted into being a number for each category. However, I want to increase the number of columns so that I have a dummy for each category. What I find is that I have no idea where the dummy variables have gone, which I need to make an interpretable solution e.g. if a user is from Thailand or not, that variable is significant in a logistic regression.
Here is my code:
%categoricalnbs is the number converted version for all the categorical
%variables. Some columns in that table have categories 1-200, some just
%have categories 1 to 20.
categoricalnbsarray = table2array(categoricalnbs);
% categoricalnbsarray = table2array(finalnbs(:,[9:26,28]));
%finalnbs keeps the actual category names, which I thought could help with
%generating the column labels for the dummyvars, but using that line
%doesn't help.
[~, ~, ugroupA] = unique(categoricalnbsarray(:,2));
dummyvars=dummyvar(ugroupA);
array2table(dummyvars);
This increases the columns in categoricalnbs from 19 to 200, and retains the same number of rows. But how do I interpret the output...

1 Comment
Alessandro Roux
on 29 Dec 2015
Hi Dhruv,
I'm curious what is the output that you expect to receive from "dummyvars".
If I've understood correctly, you have an array of values where each value is the numerical representation of one of twenty categories.
You call "unique" on this array of values and request three outputs from "unique". The first output, which you ignore, contains the unique values of "categoricalnbs" in sorted order (if there are 20 categories numbered 1-20, I'd expect this to be an array from 1 to 20).
The second output, which you also ignore, the indices of "categoricalnbs" that will return each unique value of "categoricalnbs" in sorted order (i.e. the first output of your "unique" call).
The final output, which you call "ugroupA", will return an order of indices that, if applied to the first output, will return all of the values of "categoricalnbs" stacked column after column.
In your code, you are looking for the dummy variables of an index reconstruction of "categoricalnbs".
If I have understood correctly, there will be as many dummy variables as there are unique categories in "categoricalnbs". So, if column 2 of "categoricalnbs" contains 200 possible categories, then you would expect a "dummyvars" output with 200 columns (i.e. 200 dummy variables).
Do you mind clarifying what, in particular, confuses you about the output that you are receiving?
Alessandro
Accepted Answer
More Answers (2)
I think you can make sense of this by following the documentation: When group is a numeric vector, dummyvar assumes that the groups and their order are 1:max(group). In other words, column order corresponds to the order of the levels. For nominal arrays, the default order is ascending alphabetical.
So, the first dummy variable will be the category with the lowest value, the second will be the second lowest value, etc. You can check this to be sure by comparing the dummy value with the category.
An easy way to make your data more intelligible is to sort by the category before creating the dummies. Then, it should have a nice pattern to it which will be easy to understand.
3 Comments
Dhruv Ghulati
on 30 Dec 2015
Dhruv Ghulati
on 30 Dec 2015
Image Analyst
on 30 Dec 2015
Did you see my answer about ordinal()?
Image Analyst
on 30 Dec 2015
0 votes
There is a function to do this, if I understand you correctly, in the Statistics and Machine Learning Toolbox. It's called ordinal().
1 Comment
Sean de Wolski
on 30 Dec 2015
That's the old way and its use is now discouraged with categoricals in base MATLAB.
Categories
Find more on Analysis of Variance and Covariance in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!