# Fault Detection and Remaining Useful Life Estimation Using Categorical Data

This example shows how to create models for fault classification and remaining useful life (RUL) estimation using categorical machine data. Categorical data is data that has values in a finite set of discrete categories. For machine data, categorical variables can be the manufacturer's code, location of the machine, and experience level of the operators. You can use these variables as predictors, along with other measured sensor data, to help identify which machines will need maintenance.

Here, you use categorical variables to train a binary decision tree model that classifies if machines are broken. Then you fit a covariate survival model to the data to predict RUL.

### Data Set

The data set [1] contains simulated sensor records of 999 machines made by four different providers with slight variations among their models. The simulated machines were used by three different teams over the simulation timespan. In total, the data set contains seven variables per machine:

`Lifetime`

(numeric) — Number of weeks the machine has been active`Broken`

(boolean) — Machine status, where true indicates a broken machine`PressureInd`

(numeric) — Pressure index. A sudden drop can indicate a leak.`MoistureInd`

(numeric) — Moisture index (relative humidity). Excessive humidity can create mold and damage the equipment.`TemperatureInd`

(numeric) — Temperature index`Team`

(categorical) — Team using the machine, represented as a string`Provider`

(categorical) — Machine manufacturer name, represented as a string

The strings in the `Team`

and `Provider`

data represent categorical variables that contain nonnumeric data. Typically, categorical variables have the following forms:

String or character data types: Can be used for nominal categorical variables, where the value does not have any ranking or order

Integer or enumerated data types: Can be used for ordinal categorical variables, where the values have a natural order or ranking

Boolean data type: Can be used for categorical variables with only two values

In addition, MATLAB® provides a special data type, `categorical`

, that you can be use for machine-learning computations. The `categorical`

command converts an array of values to a `categorical`

array.

Load the data.

`load('simulatedData.mat');`

Plot histograms of the numeric machine variables in the data set to visualize the variable distribution. The histograms help you assess the distribution of the values and identify outliers or unusual patterns within the data set. These histograms show that the data in `pressureInd`

, `moistureInd`

, and `temperatureInd`

is normally distributed.

figure; tiledlayout(1,3) % nexttile; histogram(simulatedData.pressureInd); title('Pressure Index'); % nexttile; histogram(simulatedData.moistureInd); title('Moisture Index'); % nexttile; histogram(simulatedData.temperatureInd); title('Temperature Index');

Create histograms of the categorical variables. The categorical values are well balanced.

figure; tiledlayout(1,2) % nexttile; histogram(categorical(simulatedData.team)); title('Team Name'); % nexttile; histogram(categorical(simulatedData.provider)); title('Machine Manufacturer');

### Prepare Categorical Variables

To use categorical variables as predictors in machine learning models, convert them to numeric representations. The categorical variables in the data set have a data type of `string`

. MATLAB provides a special data type, `categorical`

, that you can use for computations that are designed specifically for categorical data. The `categorical`

command converts an array of values to a `categorical`

array.

Once you have converted the strings to `categorical`

arrays, you can convert the arrays into a set of binary variables. The software uses a one-hot encoding technique to perform the conversion, with one variable for each category. This format allows the model to treat each category as a separate input. For more information about categorical variables and operations that can be performed on them, see Dummy Variables.

Use the `dummyvar`

function to convert the values in the `team`

and `provider`

variables to numeric representation. Create a table that includes the original Boolean and numeric variables and the newly encoded variables.

opTeam = categorical(simulatedData.team); opTeamEncoded = dummyvar(opTeam); operatingTeam = array2table(opTeamEncoded,'VariableNames',categories(opTeam)); providers = categorical(simulatedData.provider); providersEncoded = dummyvar(providers); providerNames = array2table(providersEncoded,'VariableNames',categories(providers)); dataTable = [simulatedData(:,{'lifetime','broken','pressureInd','moistureInd','temperatureInd'}), operatingTeam, providerNames]; head(dataTable)

lifetime broken pressureInd moistureInd temperatureInd TeamA TeamB TeamC Provider1 Provider2 Provider3 Provider4 ________ ______ ___________ ___________ ______________ _____ _____ _____ _________ _________ _________ _________ 56 0 92.179 104.23 96.517 1 0 0 0 0 0 1 81 1 72.076 103.07 87.271 0 0 1 0 0 0 1 60 0 96.272 77.801 112.2 1 0 0 1 0 0 0 86 1 94.406 108.49 72.025 0 0 1 0 1 0 0 34 0 97.753 99.413 103.76 0 1 0 1 0 0 0 30 0 87.679 115.71 89.792 1 0 0 1 0 0 0 68 0 94.614 85.702 142.83 0 1 0 0 1 0 0 65 1 96.483 93.047 98.316 0 1 0 0 0 1 0

### Partition Data Set into Training Set and Testing Set

To prevent overfitting of your model, you can partition your data to use a subset for training the model and another subset for testing it afterwards. One common partitioning method is to hold out 20%–30% for testing, leaving 70%–80% to train. You can adjust these percentages based on the specific characteristics of the data set and problem. For more information on other partitioning methods, see What is Cross-Validation?

Here, partition your data set using `cvpartition`

with a `Holdout`

of 2.0.

An alternative approach is to use `kfold`

instead of `holdout`

for `cvpartition`

, but then use `holdout`

to reserve a testing data set to validate the model after training.

After partitioning, use the indices that `cvpartition`

returns to extract training and testing data sets (trainData and testData). For reproducibility of the example results, first initialize the random number generator `rng`

.

rng('default') partition = cvpartition(size(dataTable,1),'Holdout',0.20); trainIndices = training(partition); testIndices = test(partition); trainData = dataTable(trainIndices,:); testData = dataTable(testIndices,:);

Extract the predictor and response columns from the data sets. The predictor sets are `Xtrain`

and `Xtest`

. The response sets are `Ytrain`

and `Ytest`

. Use `~strcmpi`

to exclude the `'broken'`

information from the predictor sets.

Xtrain = trainData(:,~strcmpi(trainData.Properties.VariableNames, 'broken')); Ytrain = trainData(:,'broken'); Xtest = testData(:,~strcmpi(trainData.Properties.VariableNames, 'broken')); Ytest = testData(:,'broken');

### Train Model

To choose a machine learning model, there are several options, such as `fitctree`

, `fitcsvm`

, and `fitcknn`

. In this example, use the `fitctree`

function to create a binary classification tree from the training data in `Xtrain`

and corresponding responses in `Ytrain`

. This model is chosen because of its efficiency and interpretability.

treeMdl = fitctree(Xtrain,Ytrain);

Typically, to better assess the performance and generalization ability of a model on unseen data, cross-validation can be applied. In cross validation, the data is first partitioned into subsets. Then, the model is trained on one subset, and its performance is evaluated on the remaining subset. This process is repeated multiple times to obtain reliable performance estimates.

Create a partitioned model `partitionedModel`

. It is common to compute the 5-fold cross-validation misclassification error to strike a balance between variance reduction and computational load. By default, `crossval`

ensures that the class proportions in each fold remain approximately the same as the class proportions in the response variable `Ytrain`

.

```
partitionedModel = crossval(treeMdl,'KFold',5);
validationAccuracy = 1-kfoldLoss(partitionedModel)
```

validationAccuracy = 0.9675

### Testing

Use the `loss`

function to evaluate the performance of the decision tree model. This function quantifies the discrepancy between the predicted outputs of the model and the true target values in the training data. The model error, `mdlError`

, that `loss`

returns represents the total error percentage in the testing set. Subtracting `mdlError`

from 1 provides the accuracy.

The goal is to minimize the error, indicating better model performance.

mdlError = loss(treeMdl,Xtest,Ytest)

mdlError = 0.0348

testAccuracyWithCategoricalVars = 1-mdlError

testAccuracyWithCategoricalVars = 0.9652

### Impact of Using Categorical variables

To assess the difference in the performance of the classification model with and without the categorical variables, repeat the previous steps to train another classification decision tree model without using categorical variables as features.

Xtrain_ = trainData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'}); Ytrain_ = trainData(:,{'broken'}); Xtest_ = testData(:,{'lifetime','pressureInd','moistureInd','temperatureInd'}); Ytest_ = testData(:,{'broken'}); treeMdl_NoCatVars = fitctree(Xtrain_,Ytrain_); %Training partitionedModel_NoCategorical = crossval(treeMdl_NoCatVars,'KFold',5); %Validation validationAccuracy_NoCategorical = 1-kfoldLoss(partitionedModel_NoCategorical) %Validation

validationAccuracy_NoCategorical = 0.9238

`testAccuracyWithoutCategoricalVars = 1-loss(treeMdl_NoCatVars,Xtest_,Ytest_) %Testing`

testAccuracyWithoutCategoricalVars = 0.9312

The performance drops from over 96% accuracy to around 93% accuracy when the predictors exclude the categorical data. This result suggests that, in this scenario, including categorical variables contribute to an increase in accuracy and to better performance.

### Fit Covariate Survival Model to Data

In this section, fit a covariate survival model to the data set to predict the remaining useful life (RUL) of a machine. Covariate survival models are useful when the only data are the failure times and associated covariates for an ensemble of similar components, such as multiple machines manufactured to the same specifications. Covariates are environmental or explanatory variables, such as the component manufacturer or operating conditions. Assuming that the `broken`

status of a machine indicates end of life, a `covariateSurvivalModel`

estimates the remaining useful life (RUL) of a component using a proportional hazard survival model. Note that for this case, the nonnumeric data related to `team`

and `provider`

names can be used directly without performing additional encoding. The model encodes them using the specified option, one-hot encoding in this case.

clearvars -except simulatedData mdl = covariateSurvivalModel('LifeTimeVariable',"lifetime", 'LifeTimeUnit',"days", ... 'DataVariables',["pressureInd","moistureInd","temperatureInd", "team", "provider"], ... 'EncodedVariables', ["team", "provider"], "censorVariable", "broken"); mdl.EncodingMethod = 'binary';

Split `simulatedData`

into fitting data and test data. Define test data as rows 4 and 5 in the `simulatedData`

table.

```
Ctrain = simulatedData;
Ctrain(4:5,:) = [];
Ctest = simulatedData(4:5, ~strcmpi(simulatedData.Properties.VariableNames, 'broken'))
```

`Ctest=`*2×6 table*
lifetime pressureInd moistureInd temperatureInd team provider
________ ___________ ___________ ______________ _________ _____________
86 94.406 108.49 72.025 {'TeamC'} {'Provider2'}
34 97.753 99.413 103.76 {'TeamB'} {'Provider1'}

Fit the covariate survival model with the training data.

fit(mdl, Ctrain)

Successful convergence: Norm of gradient less than OPTIONS.TolFun

Once the model is fit, verify the test data. The test data set response for row 4 is `'broken'`

. For row 5 it is `'not broken'`

.

predictRUL(mdl, Ctest(1,:))

`ans = `*duration*
-44.405 days

predictRUL(mdl, Ctest(2,:))

`ans = `*duration*
10.997 days

The output of the `predictRUL`

function is in days for this example, indicating the estimated remaining useful life of the machines. Positive days indicates estimated number of days to failure and negative days indicates that the machine is past its estimated end-of-life time. Therefore, the model is able to estimate the RUL successfully for both of the test data points. Note that the data set used in this example is not very large. Using a larger data set for training will make the resulting model more robust and will therefore improve prediction accuracy.

#### References

[1] Data set created by https://walkercodetutorials.com

## See Also

`categorical`

| `dummyvar`

| `predictRUL`

| `covariateSurvivalModel`

| `cvpartition`

| `fitctree`

| `fitcknn`

| `fitcsvm`

| `strcmpi`