MATLAB Answers

Why does categories(TableA.Var2) have more elements than the rows in TableA?

2 views (last 30 days)
Hannes Truter
Hannes Truter on 16 Jan 2020
Edited: Hannes Truter on 17 Jan 2020
I do not understand the behaviour I'm seeing when I use categories(TableA.Var2) with a table. I'm getting more categories than the amount of rows in the table and I'm also seeing values in the categories list that do not appear in the column of the table that I'm looking at.
I read a table from an Excel file and then want to create a list of the unique Fault Labels. When I use unique(ModuleFC.FaultLabel) I get a list of the unique Fault Labels. I'm very confused why my attempt to use FaultList = categories(ModuleFC.FaultLabel) gives me more elements than the amount of rows in the table.
function FaultCodeGroups
%% Import data from spreadsheet
% Script for importing data from the following spreadsheet:
% Workbook: C:\FaultCodesTable.xlsx
% Worksheet: FaultCodesTable
% To extend the code for use with different selected data or a different
% spreadsheet, generate a function instead of a script.
% Auto-generated by MATLAB on 2020/01/16 12:14:02
%% Import the data
[~, ~, raw] = xlsread('C:\FaultCodes.xlsx','FaultCodes');
raw(cellfun(@(x) ~isempty(x) && isnumeric(x) && isnan(x),raw)) = {''};
stringVectors = string(raw(:,[1,2,3]));
stringVectors(ismissing(stringVectors)) = '';
%% Replace non-numeric cells with NaN
R = cellfun(@(x) ~isnumeric(x) && ~islogical(x),raw); % Find non-numeric cells
raw(R) = {NaN}; % Replace non-numeric cells
%% Create table
FaultCodesTable = table;
%% Allocate imported array to column variable names
FaultCodesTable.Model = categorical(stringVectors(:,1));
FaultCodesTable.FaultLabel = categorical(stringVectors(:,2));
FaultCodesTable.Module = categorical(stringVectors(:,3));
%% Clear temporary variables
clearvars data raw stringVectors R;
% Get a list of all the different car models
CarList = categories(FaultCodesTable.Model);
% Get a list of all the modules
ModuleList = categories(FaultCodesTable.Module);
%% Extract the fault codes for each car line
for CLCount = 1:numel(CarList)
% Extract table with only the fault codes of the currently selected car line
CarLineFaultCodes = FaultCodesTable(FaultCodesTable.Model == CarList{CLCount},:);
% Delete FaultCodesTable to ensure its Faultlabel column does not exist in memory
% This is only for debugging. Can not use this in actual for loop
FaultCodesTable = [];
%% Extract the fault codes for each module
for MCount = 1:numel(ModuleList)
% Extract the fault codes for the currently selected module
ModuleFC = CarLineFaultCodes(CarLineFaultCodes.Module == ModuleList{MCount},:);
% Delete CarLineFaultCodes to ensure its Faultlabel column does not exist
% This is only for debugging. Can not use this in actual for loop
CarLineFaultCodes = [];
% ModuleFC has 996 rows
% Get a list of the unique Fault Labels in ModuleFC
UniqueFaults = unique(ModuleFC.FaultLabel);
numel(UniqueFaults) % 59 unique Fault Labels
% Get categories in FaultLabel column of ModuleFC
FaultList = categories(ModuleFC.FaultLabel);
UniqueFL = unique(FaultList);
numel(UniqueFL) % 1127 ,but ModuleFC has 996 rows ???


Sign in to comment.

Accepted Answer

Steven Lord
Steven Lord on 16 Jan 2020
Your CarLineFaultCodes table only contains a subset of the rows of FaultCodesTable, and your ModuleFC table only contains a subset of the rows of CarLineFaultCodes. But CarLineFaultCodes and ModuleFC were created from FaultCodesTable via indexing. Therefore the FaultLabel variable in those table arrays were created from the FaultLabel variable in the original table and can take as a value any of the category values the FaultLabel variable in that original table could take.
Indexing into a categorical variable to create a new variable with a subset of the entries doesn't trim the list of categories the new variable can take to just those actually present in that subset. Doing so would be inefficient if the subset was large (we'd need to compute the unique set of categories present.) It also could cause problems if you wanted to concatenate that subset with a different subset that contained a category that had been trimmed (especially if the original categorical array isprotected.)
If you really do want to trim the list of categories, you can use removecats to do so. But you will need to explicitly do so, MATLAB will not do it for you automatically.


Show 1 older comment
Eric Sofen
Eric Sofen on 16 Jan 2020
Beyond the performance issues that Steve pointed out, conceptually a categorical should carry around the complete list of categories even when you take a subset of the values. "US States" is a set of data that should be stored in a categorical. There are 50 categories (Alabama, Alaska, ...), but not all 50 may be represented in your data. By "remebering" all 50 categories in the categorical array, you can then ask questions like "Which states are not present in my data?" using countcats or histogram.
There are of course other situations where categorical may be useful even when you don't know the complete list of possible categories a priori. That's fine, too.
Peter Perkins
Peter Perkins on 16 Jan 2020
Just to expand on what Steve said:
The main purpose of categorical (over, say, an array of strings) is to maintain a list of all the possible values that your data could take on. Just because your current data set only contains things that happened on Mon, Tue, and Fri doesn't mean that you stoped caring about Wed and Thu, it just means that the data you have in hand happens to not have anything on Wed and Thu. So if you want to compute, say, the total sales by day of week, it would probably be useful to know that the totals for Web and Thu are 0.
Sometimes not, but mostly you want to hang on to knowledge of the possible values even if your current data don't happen to have any instances of them.
It's easy to drop "unused" categories, just call removecats. The mirror image is that when you are reading data from a file and converting to categorical, it is often a good idea to specify all the possible categories, in case your data don't hit all the possibilities.
Hannes Truter
Hannes Truter on 17 Jan 2020
Thank you very much Eric and Peter for taking time to explain categorical in even more detail and giving examples. I definitely had the wrong idea of how categorical worked and how it should be used correctly. With your explanations I should be able to modify my code to get out the data I want.
I am very impressed by the MATLAB support by staff members. It is very rare nowadays to see software companies providing real support. Many rely on user forums to do it for them.

Sign in to comment.

More Answers (0)

Sign in to answer this question.