Main Content

describe

Describe generated features

Since R2021a

    Description

    describe(Transformer) prints the description of the features generated by Transformer. Create the FeatureTransformer object Transformer by using the gencfeatures or genrfeatures function.

    describe(Transformer,Index) prints the description of the features identified by Index.

    example

    Info = describe(___) returns the feature descriptions in a table. Row names of Info correspond to the names of the features.

    Examples

    collapse all

    Generate features from a table of predictor data by using gencfeatures. Inspect the generated features by using the describe object function.

    Read power outage data into the workspace as a table. Remove observations with missing values, and display the first few rows of the table.

    outages = readtable("outages.csv");
    Tbl = rmmissing(outages);
    head(Tbl)
           Region           OutageTime        Loss     Customers     RestorationTime            Cause       
        _____________    ________________    ______    __________    ________________    ___________________
    
        {'SouthWest'}    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    {'winter storm'   }
        {'SouthEast'}    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    {'winter storm'   }
        {'West'     }    2004-04-06 05:44    434.81    3.4037e+05    2004-04-06 06:10    {'equipment fault'}
        {'MidWest'  }    2002-03-16 06:18    186.44    2.1275e+05    2002-03-18 23:23    {'severe storm'   }
        {'West'     }    2003-06-18 02:49         0             0    2003-06-18 10:54    {'attack'         }
        {'NorthEast'}    2003-07-16 16:23    239.93         49434    2003-07-17 01:12    {'fire'           }
        {'MidWest'  }    2004-09-27 11:09    286.72         66104    2004-09-27 16:37    {'equipment fault'}
        {'SouthEast'}    2004-09-05 17:48    73.387         36073    2004-09-05 20:46    {'equipment fault'}
    

    Some of the variables, such as OutageTime and RestorationTime, have data types that are not supported by classifier training functions like fitcensemble.

    Generate 25 features from the predictors in Tbl that can be used to train a bagged ensemble. Specify the Region table variable as the response.

    Transformer = gencfeatures(Tbl,"Region",25,TargetLearner="bag")
    Transformer = 
      FeatureTransformer with properties:
    
                         Type: 'classification'
                TargetLearner: 'bag'
        NumEngineeredFeatures: 22
          NumOriginalFeatures: 3
             TotalNumFeatures: 25
    
    

    The Transformer object contains the information about the generated features and the transformations used to create them.

    To better understand the generated features, use the describe object function.

    Info = describe(Transformer)
    Info=25×4 table
                                         Type        IsOriginal          InputVariables                                                            Transformations                                                 
                                      ___________    __________    ___________________________    _________________________________________________________________________________________________________________
    
        Loss                          Numeric          true        Loss                           ""                                                                                                               
        Customers                     Numeric          true        Customers                      ""                                                                                                               
        c(Cause)                      Categorical      true        Cause                          "Variable of type categorical converted from a cell data type"                                                   
        RestorationTime-OutageTime    Numeric          false       OutageTime, RestorationTime    "Elapsed time in seconds between OutageTime and RestorationTime"                                                 
        sdn(OutageTime)               Numeric          false       OutageTime                     "Serial date number from 01-Feb-2002 12:18:00"                                                                   
        woe3(c(Cause))                Numeric          false       Cause                          "Variable of type categorical converted from a cell data type -> Weight of Evidence (positive class = SouthEast)"
        doy(OutageTime)               Numeric          false       OutageTime                     "Day of the year"                                                                                                
        year(OutageTime)              Numeric          false       OutageTime                     "Year"                                                                                                           
        kmd1                          Numeric          false       Loss, Customers                "Euclidean distance to centroid 1 (kmeans clustering with k = 10)"                                               
        kmd5                          Numeric          false       Loss, Customers                "Euclidean distance to centroid 5 (kmeans clustering with k = 10)"                                               
        quarter(OutageTime)           Numeric          false       OutageTime                     "Quarter of the year"                                                                                            
        woe2(c(Cause))                Numeric          false       Cause                          "Variable of type categorical converted from a cell data type -> Weight of Evidence (positive class = NorthEast)"
        year(RestorationTime)         Numeric          false       RestorationTime                "Year"                                                                                                           
        month(OutageTime)             Numeric          false       OutageTime                     "Month of the year"                                                                                              
        Loss.*Customers               Numeric          false       Loss, Customers                "Loss .* Customers"                                                                                              
        tods(OutageTime)              Numeric          false       OutageTime                     "Time of the day in seconds"                                                                                     
          ⋮
    
    

    The Info table indicates the following:

    • The first three generated features are original to Tbl, although the software converts the original Cause variable to a categorical variable c(Cause).

    • The OutageTime and RestorationTime variables are not included as generated features because they are datetime variables, which cannot be used to train a bagged ensemble model. However, the software derives many of the generated features from these variables, such as the fourth feature RestorationTime-OutageTime.

    • Some generated features are a combination of multiple transformations. For example, the software generates the sixth feature woe3(c(Cause)) by converting the Cause variable to a categorical variable and then calculating the Weight of Evidence values for the resulting variable.

    Generate features from a table of predictor data by using genrfeatures. Inspect the generated features by using the describe object function.

    Read power outage data into the workspace as a table. Remove observations with missing values, and display the first few rows of the table.

    outages = readtable("outages.csv");
    Tbl = rmmissing(outages);
    head(Tbl)
           Region           OutageTime        Loss     Customers     RestorationTime            Cause       
        _____________    ________________    ______    __________    ________________    ___________________
    
        {'SouthWest'}    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    {'winter storm'   }
        {'SouthEast'}    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    {'winter storm'   }
        {'West'     }    2004-04-06 05:44    434.81    3.4037e+05    2004-04-06 06:10    {'equipment fault'}
        {'MidWest'  }    2002-03-16 06:18    186.44    2.1275e+05    2002-03-18 23:23    {'severe storm'   }
        {'West'     }    2003-06-18 02:49         0             0    2003-06-18 10:54    {'attack'         }
        {'NorthEast'}    2003-07-16 16:23    239.93         49434    2003-07-17 01:12    {'fire'           }
        {'MidWest'  }    2004-09-27 11:09    286.72         66104    2004-09-27 16:37    {'equipment fault'}
        {'SouthEast'}    2004-09-05 17:48    73.387         36073    2004-09-05 20:46    {'equipment fault'}
    

    Some of the variables, such as OutageTime and RestorationTime, have data types that are not supported by regression model training functions like fitrensemble.

    Generate 25 features from the predictors in Tbl that can be used to train a bagged ensemble. Specify the Loss table variable as the response.

    rng("default") % For reproducibility
    Transformer = genrfeatures(Tbl,"Loss",25,TargetLearner="bag")
    Transformer = 
      FeatureTransformer with properties:
    
                         Type: 'regression'
                TargetLearner: 'bag'
        NumEngineeredFeatures: 22
          NumOriginalFeatures: 3
             TotalNumFeatures: 25
    
    

    The Transformer object contains the information about the generated features and the transformations used to create them.

    To better understand the generated features, use the describe object function.

    Info = describe(Transformer)
    Info=25×4 table
                                         Type        IsOriginal          InputVariables                                     Transformations                          
                                      ___________    __________    ___________________________    ___________________________________________________________________
    
        c(Region)                     Categorical      true        Region                         "Variable of type categorical converted from a cell data type"     
        Customers                     Numeric          true        Customers                      ""                                                                 
        c(Cause)                      Categorical      true        Cause                          "Variable of type categorical converted from a cell data type"     
        kmd2                          Numeric          false       Customers                      "Euclidean distance to centroid 2 (kmeans clustering with k = 10)" 
        kmd1                          Numeric          false       Customers                      "Euclidean distance to centroid 1 (kmeans clustering with k = 10)" 
        kmd4                          Numeric          false       Customers                      "Euclidean distance to centroid 4 (kmeans clustering with k = 10)" 
        kmd5                          Numeric          false       Customers                      "Euclidean distance to centroid 5 (kmeans clustering with k = 10)" 
        kmd9                          Numeric          false       Customers                      "Euclidean distance to centroid 9 (kmeans clustering with k = 10)" 
        cos(Customers)                Numeric          false       Customers                      "cos( )"                                                           
        RestorationTime-OutageTime    Numeric          false       OutageTime, RestorationTime    "Elapsed time in seconds between OutageTime and RestorationTime"   
        kmd6                          Numeric          false       Customers                      "Euclidean distance to centroid 6 (kmeans clustering with k = 10)" 
        kmi                           Categorical      false       Customers                      "Cluster index encoding (kmeans clustering with k = 10)"           
        kmd7                          Numeric          false       Customers                      "Euclidean distance to centroid 7 (kmeans clustering with k = 10)" 
        kmd3                          Numeric          false       Customers                      "Euclidean distance to centroid 3 (kmeans clustering with k = 10)" 
        kmd10                         Numeric          false       Customers                      "Euclidean distance to centroid 10 (kmeans clustering with k = 10)"
        hour(RestorationTime)         Numeric          false       RestorationTime                "Hour of the day"                                                  
          ⋮
    
    

    The first three generated features are original to Tbl, although the software converts the original Region and Cause variables to categorical variables.

    Info(1:3,:) % describe(Transformer,1:3)
    ans=3×4 table
                        Type        IsOriginal    InputVariables                           Transformations                        
                     ___________    __________    ______________    ______________________________________________________________
    
        c(Region)    Categorical      true          Region          "Variable of type categorical converted from a cell data type"
        Customers    Numeric          true          Customers       ""                                                            
        c(Cause)     Categorical      true          Cause           "Variable of type categorical converted from a cell data type"
    
    

    The OutageTime and RestorationTime variables are not included as generated features because they are datetime variables, which cannot be used to train a bagged ensemble model. However, the software derives some generated features from these variables, such as the tenth feature RestorationTime-OutageTime.

    Info(10,:) % describe(Transformer,10)
    ans=1×4 table
                                       Type      IsOriginal          InputVariables                                   Transformations                         
                                      _______    __________    ___________________________    ________________________________________________________________
    
        RestorationTime-OutageTime    Numeric      false       OutageTime, RestorationTime    "Elapsed time in seconds between OutageTime and RestorationTime"
    
    

    Some generated features are a combination of multiple transformations. For example, the software generates the nineteenth feature fenc(c(Cause)) by converting the Cause variable to a categorical variable with 10 categories and then calculating the frequency of the categories.

    Info(19,:) % describe(Transformer,19)
    ans=1×4 table
                           Type      IsOriginal    InputVariables                                                  Transformations                                               
                          _______    __________    ______________    ____________________________________________________________________________________________________________
    
        fenc(c(Cause))    Numeric      false           Cause         "Variable of type categorical converted from a cell data type -> Frequency encoding (number of levels = 10)"
    
    

    Input Arguments

    collapse all

    Feature transformer, specified as a FeatureTransformer object.

    Features to describe, specified as a numeric or logical vector indicating the position of the features, or a string array or cell array of character vectors indicating the names of the features.

    Example: 1:12

    Data Types: single | double | logical | string | cell

    Output Arguments

    collapse all

    Feature descriptions, returned as a table. Each row corresponds to a generated feature, and each column provides the following information.

    Column NameDescription
    TypeIndicates the data type of the feature, either numeric or categorical
    IsOriginalIndicates whether the feature is an original feature (true) or an engineered feature (false)
    InputVariablesIndicates the original features used to generate the feature
    TransformationsDescribes the transformations used to generate the feature, in the order they are applied — For more information, see Feature Transformations.

    Algorithms

    collapse all

    Feature Transformations

    This table provides additional information on some of the more complex feature transformation descriptions in Info.Transformations.

    Sample Feature NameSample Transformation Description in InfoAdditional Information
    eb4(Variable)Equal-width binning (number of bins = 4)The software splits the Variable values into 4 bins of equal width. The resulting feature is a categorical variable.
    fenc(Variable)Frequency encoding (number of levels = 10)The software calculates the frequency of the 10 categories (or levels) in Variable. In the resulting feature, the software replaces each categorical value with the corresponding category frequency, creating a numeric variable.
    kmc1Centroid encoding (component #1) (kmeans clustering with k = 10)The software uses k-means clustering to assign each observation to one of 10 clusters. Each row in the resulting feature corresponds to an observation and is the 1st component of the cluster centroid associated with that observation. The resulting feature is a numeric variable.
    kmd4Euclidean distance to centroid 4 (kmeans clustering with k = 10)The software uses k-means clustering to assign each observation to one of 10 clusters. Each row in the resulting feature is the Euclidean distance from the corresponding observation to the centroid of the 4th cluster. The resulting feature is a numeric variable.
    kmiCluster index encoding (kmeans clustering with k = 10)The software uses k-means clustering to assign each observation to one of 10 clusters. Each row in the resulting feature is the cluster index for the corresponding observation. The resulting feature is a categorical variable.
    q50(Variable)Equiprobable binning (number of bins = 50)The software splits the Variable values into 50 bins of equal probability. The resulting feature is a categorical variable.
    woe5(Variable)Weight of Evidence (positive class = Class5)

    This transformation is available for classification problems only.

    The software performs the following steps to create the resulting feature:

    • Calculate how many total observations have Class5 as a response (a) and how many have a different response (b).

    • Suppose Variable is a nominal categorical variable. Then, for each category in Variable, determine how many observations in that category have Class5 as a response (c) and how many have a different response (d).

      Suppose Variable is an ordinal categorical variable instead. Then, for each category in Variable, find all the observations in that category or a smaller category, and determine how many of those observations have Class5 as a response (c) and how many have a different response (d).

    • For each category, compute the Weight of Evidence (WoE) as

      ln((c+0.5)/a(d+0.5)/b).

    • Replace each categorical value with the corresponding WoE, creating a numeric variable.

    Version History

    Introduced in R2021a