# describe

Describe generated features

## Syntax

``describe(Transformer)``
``describe(Transformer,Index)``
``Info = describe(___)``

## Description

````describe(Transformer)` prints the description of the features generated by `Transformer`. Create the `FeatureTransformer` object `Transformer` by using the `gencfeatures` or `genrfeatures` function.```
````describe(Transformer,Index)` prints the description of the features identified by `Index`.```

````Info = describe(___)` returns the feature descriptions in a table. Row names of `Info` correspond to the names of the features.```

## Examples

Generate features from a table of predictor data by using `gencfeatures`. Inspect the generated features by using the `describe` object function.

Read power outage data into the workspace as a table. Remove observations with missing values, and display the first few rows of the table.

```outages = readtable("outages.csv"); Tbl = rmmissing(outages); head(Tbl)```
```ans=8×6 table Region OutageTime Loss Customers RestorationTime Cause _____________ ________________ ______ __________ ________________ ___________________ {'SouthWest'} 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 {'winter storm' } {'SouthEast'} 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 {'winter storm' } {'West' } 2004-04-06 05:44 434.81 3.4037e+05 2004-04-06 06:10 {'equipment fault'} {'MidWest' } 2002-03-16 06:18 186.44 2.1275e+05 2002-03-18 23:23 {'severe storm' } {'West' } 2003-06-18 02:49 0 0 2003-06-18 10:54 {'attack' } {'NorthEast'} 2003-07-16 16:23 239.93 49434 2003-07-17 01:12 {'fire' } {'MidWest' } 2004-09-27 11:09 286.72 66104 2004-09-27 16:37 {'equipment fault'} {'SouthEast'} 2004-09-05 17:48 73.387 36073 2004-09-05 20:46 {'equipment fault'} ```

Some of the variables, such as `OutageTime` and `RestorationTime`, have data types that are not supported by classifier training functions like `fitcensemble`.

Generate 25 features from the predictors in `Tbl` that can be used to train a bagged ensemble. Specify the `Region` table variable as the response.

`Transformer = gencfeatures(Tbl,"Region",25,TargetLearner="bag")`
```Transformer = FeatureTransformer with properties: Type: 'classification' TargetLearner: 'bag' NumEngineeredFeatures: 22 NumOriginalFeatures: 3 TotalNumFeatures: 25 ```

The `Transformer` object contains the information about the generated features and the transformations used to create them.

To better understand the generated features, use the `describe` object function.

`Info = describe(Transformer)`
```Info=25×4 table Type IsOriginal InputVariables Transformations ___________ __________ ___________________________ _________________________________________________________________________________________________________________ Loss Numeric true Loss "" Customers Numeric true Customers "" c(Cause) Categorical true Cause "Variable of type categorical converted from a cell data type" RestorationTime-OutageTime Numeric false OutageTime, RestorationTime "Elapsed time in seconds between OutageTime and RestorationTime" sdn(OutageTime) Numeric false OutageTime "Serial date number from 01-Feb-2002 12:18:00" woe3(c(Cause)) Numeric false Cause "Variable of type categorical converted from a cell data type -> Weight of Evidence (positive class = SouthEast)" doy(OutageTime) Numeric false OutageTime "Day of the year" year(OutageTime) Numeric false OutageTime "Year" kmd1 Numeric false Loss, Customers "Euclidean distance to centroid 1 (kmeans clustering with k = 10)" kmd5 Numeric false Loss, Customers "Euclidean distance to centroid 5 (kmeans clustering with k = 10)" quarter(OutageTime) Numeric false OutageTime "Quarter of the year" woe2(c(Cause)) Numeric false Cause "Variable of type categorical converted from a cell data type -> Weight of Evidence (positive class = NorthEast)" year(RestorationTime) Numeric false RestorationTime "Year" month(OutageTime) Numeric false OutageTime "Month of the year" Loss.*Customers Numeric false Loss, Customers "Loss .* Customers" tods(OutageTime) Numeric false OutageTime "Time of the day in seconds" ⋮ ```

The `Info` table indicates the following:

• The first three generated features are original to `Tbl`, although the software converts the original `Cause` variable to a categorical variable `c(Cause)`.

• The `OutageTime` and `RestorationTime` variables are not included as generated features because they are `datetime` variables, which cannot be used to train a bagged ensemble model. However, the software derives many of the generated features from these variables, such as the fourth feature `RestorationTime-OutageTime`.

• Some generated features are a combination of multiple transformations. For example, the software generates the sixth feature `woe3(c(Cause))` by converting the `Cause` variable to a categorical variable and then calculating the Weight of Evidence values for the resulting variable.

Generate features from a table of predictor data by using `genrfeatures`. Inspect the generated features by using the `describe` object function.

Read power outage data into the workspace as a table. Remove observations with missing values, and display the first few rows of the table.

```outages = readtable("outages.csv"); Tbl = rmmissing(outages); head(Tbl)```
```ans=8×6 table Region OutageTime Loss Customers RestorationTime Cause _____________ ________________ ______ __________ ________________ ___________________ {'SouthWest'} 2002-02-01 12:18 458.98 1.8202e+06 2002-02-07 16:50 {'winter storm' } {'SouthEast'} 2003-02-07 21:15 289.4 1.4294e+05 2003-02-17 08:14 {'winter storm' } {'West' } 2004-04-06 05:44 434.81 3.4037e+05 2004-04-06 06:10 {'equipment fault'} {'MidWest' } 2002-03-16 06:18 186.44 2.1275e+05 2002-03-18 23:23 {'severe storm' } {'West' } 2003-06-18 02:49 0 0 2003-06-18 10:54 {'attack' } {'NorthEast'} 2003-07-16 16:23 239.93 49434 2003-07-17 01:12 {'fire' } {'MidWest' } 2004-09-27 11:09 286.72 66104 2004-09-27 16:37 {'equipment fault'} {'SouthEast'} 2004-09-05 17:48 73.387 36073 2004-09-05 20:46 {'equipment fault'} ```

Some of the variables, such as `OutageTime` and `RestorationTime`, have data types that are not supported by regression model training functions like `fitrensemble`.

Generate 25 features from the predictors in `Tbl` that can be used to train a bagged ensemble. Specify the `Loss` table variable as the response.

```rng("default") % For reproducibility Transformer = genrfeatures(Tbl,"Loss",25,TargetLearner="bag")```
```Transformer = FeatureTransformer with properties: Type: 'regression' TargetLearner: 'bag' NumEngineeredFeatures: 22 NumOriginalFeatures: 3 TotalNumFeatures: 25 ```

The `Transformer` object contains the information about the generated features and the transformations used to create them.

To better understand the generated features, use the `describe` object function.

`Info = describe(Transformer)`
```Info=25×4 table Type IsOriginal InputVariables Transformations ___________ __________ ___________________________ ___________________________________________________________________ c(Region) Categorical true Region "Variable of type categorical converted from a cell data type" Customers Numeric true Customers "" c(Cause) Categorical true Cause "Variable of type categorical converted from a cell data type" kmd2 Numeric false Customers "Euclidean distance to centroid 2 (kmeans clustering with k = 10)" kmd1 Numeric false Customers "Euclidean distance to centroid 1 (kmeans clustering with k = 10)" kmd4 Numeric false Customers "Euclidean distance to centroid 4 (kmeans clustering with k = 10)" kmd5 Numeric false Customers "Euclidean distance to centroid 5 (kmeans clustering with k = 10)" kmd9 Numeric false Customers "Euclidean distance to centroid 9 (kmeans clustering with k = 10)" cos(Customers) Numeric false Customers "cos( )" RestorationTime-OutageTime Numeric false OutageTime, RestorationTime "Elapsed time in seconds between OutageTime and RestorationTime" kmd6 Numeric false Customers "Euclidean distance to centroid 6 (kmeans clustering with k = 10)" kmi Categorical false Customers "Cluster index encoding (kmeans clustering with k = 10)" kmd7 Numeric false Customers "Euclidean distance to centroid 7 (kmeans clustering with k = 10)" kmd3 Numeric false Customers "Euclidean distance to centroid 3 (kmeans clustering with k = 10)" kmd10 Numeric false Customers "Euclidean distance to centroid 10 (kmeans clustering with k = 10)" hour(RestorationTime) Numeric false RestorationTime "Hour of the day" ⋮ ```

The first three generated features are original to `Tbl`, although the software converts the original `Region` and `Cause` variables to `categorical` variables.

`Info(1:3,:) % describe(Transformer,1:3)`
```ans=3×4 table Type IsOriginal InputVariables Transformations ___________ __________ ______________ ______________________________________________________________ c(Region) Categorical true Region "Variable of type categorical converted from a cell data type" Customers Numeric true Customers "" c(Cause) Categorical true Cause "Variable of type categorical converted from a cell data type" ```

The `OutageTime` and `RestorationTime` variables are not included as generated features because they are `datetime` variables, which cannot be used to train a bagged ensemble model. However, the software derives some generated features from these variables, such as the tenth feature `RestorationTime-OutageTime`.

`Info(10,:) % describe(Transformer,10)`
```ans=1×4 table Type IsOriginal InputVariables Transformations _______ __________ ___________________________ ________________________________________________________________ RestorationTime-OutageTime Numeric false OutageTime, RestorationTime "Elapsed time in seconds between OutageTime and RestorationTime" ```

Some generated features are a combination of multiple transformations. For example, the software generates the nineteenth feature `fenc(c(Cause))` by converting the `Cause` variable to a categorical variable with 10 categories and then calculating the frequency of the categories.

`Info(19,:) % describe(Transformer,19)`
```ans=1×4 table Type IsOriginal InputVariables Transformations _______ __________ ______________ ____________________________________________________________________________________________________________ fenc(c(Cause)) Numeric false Cause "Variable of type categorical converted from a cell data type -> Frequency encoding (number of levels = 10)" ```

## Input Arguments

Feature transformer, specified as a `FeatureTransformer` object.

Features to describe, specified as a numeric or logical vector indicating the position of the features, or a string array or cell array of character vectors indicating the names of the features.

Example: `1:12`

Data Types: `single` | `double` | `logical` | `string` | `cell`

## Output Arguments

Feature descriptions, returned as a table. Each row corresponds to a generated feature, and each column provides the following information.

Column NameDescription
`Type`Indicates the data type of the feature, either `numeric` or `categorical`
`IsOriginal`Indicates whether the feature is an original feature (`true`) or an engineered feature (`false`)
`InputVariables`Indicates the original features used to generate the feature
`Transformations`Describes the transformations used to generate the feature, in the order they are applied — For more information, see Feature Transformations.

## Algorithms

### Feature Transformations

This table provides additional information on some of the more complex feature transformation descriptions in `Info.Transformations`.

Sample Feature NameSample Transformation Description in `Info`Additional Information
`eb4(Variable)``Equal-width binning (number of bins = 4)`The software splits the `Variable` values into `4` bins of equal width. The resulting feature is a categorical variable.
`fenc(Variable)``Frequency encoding (number of levels = 10)`The software calculates the frequency of the `10` categories (or levels) in `Variable`. In the resulting feature, the software replaces each categorical value with the corresponding category frequency, creating a numeric variable.
`kmc1````Centroid encoding (component #1) (kmeans clustering with k = 10)```The software uses k-means clustering to assign each observation to one of `10` clusters. Each row in the resulting feature corresponds to an observation and is the `1`st component of the cluster centroid associated with that observation. The resulting feature is a numeric variable.
`kmd4````Euclidean distance to centroid 4 (kmeans clustering with k = 10)```The software uses k-means clustering to assign each observation to one of `10` clusters. Each row in the resulting feature is the Euclidean distance from the corresponding observation to the centroid of the `4`th cluster. The resulting feature is a numeric variable.
`kmi````Cluster index encoding (kmeans clustering with k = 10)```The software uses k-means clustering to assign each observation to one of `10` clusters. Each row in the resulting feature is the cluster index for the corresponding observation. The resulting feature is a categorical variable.
`q50(Variable)``Equiprobable binning (number of bins = 50)`The software splits the `Variable` values into `50` bins of equal probability. The resulting feature is a categorical variable.
`woe5(Variable)``Weight of Evidence (positive class = Class5)`

This transformation is available for classification problems only.

The software performs the following steps to create the resulting feature:

• Calculate how many total observations have `Class5` as a response (a) and how many have a different response (b).

• Suppose `Variable` is a nominal categorical variable. Then, for each category in `Variable`, determine how many observations in that category have `Class5` as a response (c) and how many have a different response (d).

Suppose `Variable` is an ordinal categorical variable instead. Then, for each category in `Variable`, find all the observations in that category or a smaller category, and determine how many of those observations have `Class5` as a response (c) and how many have a different response (d).

• For each category, compute the Weight of Evidence (WoE) as

`$\mathrm{ln}\left(\frac{\left(c+0.5\right)/a}{\left(d+0.5\right)/b}\right).$`

• Replace each categorical value with the corresponding WoE, creating a numeric variable.