How does fitlm set reference level with categorical variables?

Question

0 votes

I am running linear regression using fitlm with categorical datasets:

model = fitlm(DataTable ,'Score ~ Industry + Rating + Liquid')

The regressor set the Industry and Rating reference level to the 1st row cells, but for "Liquid" variable, it sets "Q1" as the reference level. I am a little confused on this select? I thought the regressor will always set the 1st row as reference for all 3 variables. Could you please explain why it choose a different reference level for the "Liquid" variable.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Cris LaPierre on 11 Oct 2024

1 vote

See this example: Linear Regression with Categorical Predictor

and this note in Algorithms:

fitlm treats a categorical predictor as follows:

A model with a categorical predictor that has L levels (categories) includes L – 1 indicator variables. The model uses the first category as a reference level, so it does not include the indicator variable for the reference level. If the data type of the categorical predictor is categorical, then you can check the order of categories by using categories and reorder the categories by using reordercats to customize the reference level. For more details about creating indicator variables, see Automatic Creation of Dummy Variables.
fitlm treats the group of L – 1 indicator variables as a single variable. If you want to treat the indicator variables as distinct predictor variables, create indicator variables manually by using dummyvar. Then use the indicator variables, except the one corresponding to the reference level of the categorical variable, when you fit a model. For the categorical predictor X, if you specify all columns of dummyvar(X) and an intercept term as predictors, then the design matrix becomes rank deficient.
Interaction terms between a continuous predictor and a categorical predictor with L levels consist of the element-wise product of the L – 1 indicator variables with the continuous predictor.
Interaction terms between two categorical predictors with L and M levels consist of the (L – 1)*(M – 1) indicator variables to include all possible combinations of the two categorical predictor levels.
You cannot specify higher-order terms for a categorical predictor because the square of an indicator is equal to itself.

9 Comments
Show 7 older comments Hide 7 older comments

Guohua on 14 Oct 2024

matlab_datasample2.mat

Please use this data set. The model output I got is as below. It looks like pick the first row Industry and Rating as reference, but Liquid, the model selects "1" as the reference. What confuses me is that the model doesn't select Industry and Rating reference levels using alphabetical order, which would be "AEROSPACE/DEFENSE" and "AAA", but it does select numerical order "Liquid" 1 as reference. To be consistent, if the solver selects the 1st row as reference for Industry and Rating, why it doesn't stick to this rule for "Liquid". Thank you.

------------------------------------------ Output -------------------------------------------------

md2 =

Linear regression model:

Score ~ 1 + Industry + Rating + Liquid

Estimated Coefficients:

Estimate SE tStat pValue

________ ______ __________ ___________

(Intercept) -30.661 31.957 -0.95945 0.33751

Industry_AIRLINES 74.075 48.623 1.5235 0.12788

Industry_TECHNOLOGY 46.33 28.915 1.6023 0.10933

Industry_RETAIL_&_SUPERMARKETS 58.743 32.97 1.7817 0.075031

Industry_OTHER_REITS 3.1752 50.35 0.063062 0.94973

Industry_PHARMACEUTICALS 23.979 39.108 0.61313 0.5399

Industry_MEDIA_ENTERTAINMENT -73.778 33.416 -2.2079 0.027427

Industry_AUTOMOTIVE_AUTO_SUPPLIERS 26.169 38.719 0.67587 0.49924

Industry_HEALTHCARE 27.374 31.002 0.88296 0.37742

Industry_RETAILERS -16.989 69.658 -0.24389 0.80736

Industry_INDUSTRIAL_OTHER -6.4796 38.917 -0.1665 0.86779

Industry_CONSUMER_PRODUCTS 23.455 36.496 0.64267 0.52055

Industry_P&C 58.788 32.565 1.8052 0.071269

Industry_REIT 85.54 32.967 2.5947 0.0095745

Industry_PKGED_FOOD_FOODSVCS_REST 30.374 33.14 0.91654 0.35955

Industry_CONSUMER_CYCLICAL_SERVICES 10.205 40.119 0.25436 0.79925

Industry_Utilities_OpCo_FMB 84.52 34.752 2.4321 0.015146

Industry_Utilities_Holdco 86.896 31.922 2.7222 0.0065725

Industry_Utilities_OpCo_Uns 71.542 35.941 1.9905 0.046742

Industry_LIFE 68.912 37.124 1.8563 0.063641

Industry_CONSTRUCTION_MACHINERY 64.929 43.476 1.4934 0.13556

Industry_AEROSPACE/DEFENSE 25.573 37.21 0.68726 0.49204

Industry_AIRCRAFT_LEASE 26.407 78.714 0.33549 0.73731

Industry_CHEMICALS 79.924 33.1 2.4146 0.015888

Industry_BANKING_US_SUB 67.288 35.675 1.8862 0.059495

Industry_BANKING_US_SR 78.517 33.661 2.3326 0.019822

Industry_MIDSTREAM 57.774 31.627 1.8267 0.067968

Industry_BROKERAGE_ASSETMANAGERS_EXCHANGES 61.685 36.962 1.6689 0.095383

Industry_CABLE_TELCO -15.638 32.765 -0.47727 0.63325

Industry_INDEPENDENT 21.359 34.483 0.61941 0.53576

Industry_OIL_FIELD_SERVICES -35.168 36.037 -0.97587 0.32931

Industry_FINANCE_COMPANIES 23.547 34.786 0.67691 0.49858

Industry_BANKING 39.035 44.918 0.86904 0.38498

Industry_LIFE_FA_BACKED_NOTES 70.792 95.979 0.73757 0.46091

Industry_DIVERSIFIED_MANUFACTURING 37.67 32.393 1.1629 0.24508

Industry_PACKAGING 24.327 42.538 0.5719 0.56749

Industry_TRANSPORTATION_SERVICES 26.77 42.432 0.63089 0.52822

Industry_ELECTRIC 79.585 59.145 1.3456 0.17867

Industry_GAMING 31.264 39.646 0.78857 0.43051

Industry_PAPER 35.469 48.39 0.73299 0.4637

Industry_BUILDING_MATERIALS 17.926 37.124 0.48287 0.62927

Industry_BEVERAGE 64.112 52.879 1.2124 0.22557

Industry_NO_INDUSTRY -61.964 71.257 -0.86958 0.38469

Industry_RAILROADS_ENVIRONMENTAL 42.524 44.653 0.95232 0.34111

Industry_HOME_CONSTRUCTION 7.4579 40.197 0.18553 0.85284

Industry_FINANCIAL_OTHER 5.8455 63.208 0.092481 0.92633

Industry_CABLE_SATELLITE -1.1275 131.6 -0.0085676 0.99317

Industry_LODGING_LEISURE 4.4847 35.959 0.12472 0.90077

Industry_Utilities_Genco 20.092 48.759 0.41207 0.68036

Industry_REFINING -14.143 50.302 -0.28117 0.77862

Industry_REITS_HEALTHCARE 63.194 50.417 1.2534 0.21027

Industry_INTEGRATED 79.255 71.394 1.1101 0.26716

Industry_HEALTHCARE_REITS -597.22 100.19 -5.9606 3.2334e-09

Industry_RETAIL_REITS 33.916 94.718 0.35808 0.72034

Industry_AUTOMOTIVE_AUTO_FINCO 53.97 69.608 0.77535 0.43828

Industry_HIGHER_ED_TXCRP 19.481 131.76 0.14785 0.88248

Industry_ENVIRONMENTAL -58.522 131.98 -0.44342 0.65753

Industry_BANKING_US_PFD 32.933 78.824 0.4178 0.67616

Industry_AIRLINES_EETC_A 33.805 95.25 0.3549 0.72272

Industry_INSURANCE_US_SUBORDINATED 53.352 78.697 0.67795 0.49793

Industry_TOBACCO 41.967 69.422 0.60451 0.54561

Industry_BANKING_GLOBAL_TLAC_SR 60.837 131.21 0.46367 0.64296

Industry_UTILITY_OTHER 5.1819 131.24 0.039483 0.96851

Rating_BA2 30.788 20.998 1.4662 0.14282

Rating_AA1 -72.611 129.89 -0.55901 0.57625

Rating_BAA3 -6.6247 19.214 -0.34478 0.73032

Rating_A3 -46.588 22.728 -2.0498 0.040584

Rating_B3 78.003 24.971 3.1238 0.0018249

Rating_AA3 -42.946 38.267 -1.1223 0.26196

Rating_BAA1 -30.175 20.288 -1.4874 0.13716

Rating_B1 33.131 21.115 1.569 0.11688

Rating_BA3 14.299 20.243 0.70638 0.48008

Rating_B2 20.609 22.265 0.92559 0.35483

Rating_A1 -71.461 28.924 -2.4707 0.013613

Rating_A2 -54.844 24.399 -2.2478 0.024757

Rating_BAA2 -24.753 19.031 -1.3007 0.1936

Rating_CAA3 534.91 43.266 12.363 2.8422e-33

Rating_AA2 -113.84 52.126 -2.1839 0.029147

Rating_CAA1 169.91 27.988 6.0709 1.667e-09

Rating_CAA2 413.13 39.372 10.493 8.8261e-25

Rating_CA 788.56 57.97 13.603 1.7066e-39

Rating_NR -98.999 131.24 -0.75431 0.4508

Rating_AAA -48.279 93.631 -0.51563 0.6062

Rating_C 3773.2 94.223 40.045 5.4623e-229

Liquid_2 -7.5509 10.665 -0.70804 0.47905

Liquid_3 23.177 11.96 1.9378 0.052866

Liquid_4 11.18 12.751 0.87681 0.38075

Liquid_5 20.929 14.331 1.4604 0.14442

Number of observations: 1386, Error degrees of freedom: 1298

Root Mean Squared Error: 128

R-squared: 0.643, Adjusted R-Squared: 0.619

F-statistic vs. constant model: 26.9, p-value = 7.06e-231

Cris LaPierre on 15 Oct 2024

Open in MATLAB Online

matlab_datasample2.mat

The reason for the behavior you are seeing is because Industry and Rating are not categorical variables.

load matlab_datasample2.mat
varfun(@class,DataSample)
ans = 1x4 table
    class_Score    class_Industry    class_Rating    class_Liquid
    ___________    ______________    ____________    ____________

      double            cell             cell        categorical 

If you want row 1 to be the reference values, then either don't use categorical data types, or use reordercats to ensure the row 1 categorical values are the first category.

Here, I'm converting Liquid to string.

DataSample = convertvars(DataSample, "Liquid","string")
DataSample = 1406x4 table
     Score               Industry                Rating     Liquid
    _______    _____________________________    ________    ______

    -92.102    {'METALS_AND_MINING'        }    {'BA1' }     "2"  
    -125.94    {'AIRLINES'                 }    {'BA2' }     "2"  
    -90.965    {'AIRLINES'                 }    {'BA1' }     "1"  
    -56.942    {'TECHNOLOGY'               }    {'AA1' }     "1"  
    -127.78    {'RETAIL_&_SUPERMARKETS'    }    {'BA1' }     "2"  
     9.7511    {'OTHER_REITS'              }    {'BAA3'}     "4"  
     4.5882    {'PHARMACEUTICALS'          }    {'A3'  }     "1"  
    -112.25    {'MEDIA_ENTERTAINMENT'      }    {'B3'  }     "5"  
    -84.497    {'AUTOMOTIVE_AUTO_SUPPLIERS'}    {'BA2' }     "2"  
    -53.485    {'HEALTHCARE'               }    {'AA3' }     "1"  
    0.51723    {'METALS_AND_MINING'        }    {'BAA1'}     "2"  
    -3.3194    {'AIRLINES'                 }    {'BA1' }     "3"  
    -62.494    {'RETAILERS'                }    {'BA2' }     "5"  
     32.613    {'INDUSTRIAL_OTHER'         }    {'B1'  }     "3"  
     8.5647    {'CONSUMER_PRODUCTS'        }    {'BA3' }     "4"  
    -4.5917    {'P&C'                      }    {'BAA1'}     "2"  
varfun(@class,DataSample)
ans = 1x4 table
    class_Score    class_Industry    class_Rating    class_Liquid
    ___________    ______________    ____________    ____________

      double            cell             cell           string   
model = fitlm(DataSample,'Score ~ Industry + Rating + Liquid')
model = 
Linear regression model:
    Score ~ 1 + Industry + Rating + Liquid

Estimated Coefficients:
                                                  Estimate      SE        tStat         pValue   
                                                  ________    ______    __________    ___________

    (Intercept)                                   -38.212     31.515       -1.2125        0.22554
    Industry_AIRLINES                              74.075     48.623        1.5235        0.12788
    Industry_TECHNOLOGY                             46.33     28.915        1.6023        0.10933
    Industry_RETAIL_&_SUPERMARKETS                 58.743      32.97        1.7817       0.075031
    Industry_OTHER_REITS                           3.1752      50.35      0.063062        0.94973
    Industry_PHARMACEUTICALS                       23.979     39.108       0.61313         0.5399
    Industry_MEDIA_ENTERTAINMENT                  -73.778     33.416       -2.2079       0.027427
    Industry_AUTOMOTIVE_AUTO_SUPPLIERS             26.169     38.719       0.67587        0.49924
    Industry_HEALTHCARE                            27.374     31.002       0.88296        0.37742
    Industry_RETAILERS                            -16.989     69.658      -0.24389        0.80736
    Industry_INDUSTRIAL_OTHER                     -6.4796     38.917       -0.1665        0.86779
    Industry_CONSUMER_PRODUCTS                     23.455     36.496       0.64267        0.52055
    Industry_P&C                                   58.788     32.565        1.8052       0.071269
    Industry_REIT                                   85.54     32.967        2.5947      0.0095745
    Industry_PKGED_FOOD_FOODSVCS_REST              30.374      33.14       0.91654        0.35955
    Industry_CONSUMER_CYCLICAL_SERVICES            10.205     40.119       0.25436        0.79925
    Industry_Utilities_OpCo_FMB                     84.52     34.752        2.4321       0.015146
    Industry_Utilities_Holdco                      86.896     31.922        2.7222      0.0065725
    Industry_Utilities_OpCo_Uns                    71.542     35.941        1.9905       0.046742
    Industry_LIFE                                  68.912     37.124        1.8563       0.063641
    Industry_CONSTRUCTION_MACHINERY                64.929     43.476        1.4934        0.13556
    Industry_AEROSPACE/DEFENSE                     25.573      37.21       0.68726        0.49204
    Industry_AIRCRAFT_LEASE                        26.407     78.714       0.33549        0.73731
    Industry_CHEMICALS                             79.924       33.1        2.4146       0.015888
    Industry_BANKING_US_SUB                        67.288     35.675        1.8862       0.059495
    Industry_BANKING_US_SR                         78.517     33.661        2.3326       0.019822
    Industry_MIDSTREAM                             57.774     31.627        1.8267       0.067968
    Industry_BROKERAGE_ASSETMANAGERS_EXCHANGES     61.685     36.962        1.6689       0.095383
    Industry_CABLE_TELCO                          -15.638     32.765      -0.47727        0.63325
    Industry_INDEPENDENT                           21.359     34.483       0.61941        0.53576
    Industry_OIL_FIELD_SERVICES                   -35.168     36.037      -0.97587        0.32931
    Industry_FINANCE_COMPANIES                     23.547     34.786       0.67691        0.49858
    Industry_BANKING                               39.035     44.918       0.86904        0.38498
    Industry_LIFE_FA_BACKED_NOTES                  70.792     95.979       0.73757        0.46091
    Industry_DIVERSIFIED_MANUFACTURING              37.67     32.393        1.1629        0.24508
    Industry_PACKAGING                             24.327     42.538        0.5719        0.56749
    Industry_TRANSPORTATION_SERVICES                26.77     42.432       0.63089        0.52822
    Industry_ELECTRIC                              79.585     59.145        1.3456        0.17867
    Industry_GAMING                                31.264     39.646       0.78857        0.43051
    Industry_PAPER                                 35.469      48.39       0.73299         0.4637
    Industry_BUILDING_MATERIALS                    17.926     37.124       0.48287        0.62927
    Industry_BEVERAGE                              64.112     52.879        1.2124        0.22557
    Industry_NO_INDUSTRY                          -61.964     71.257      -0.86958        0.38469
    Industry_RAILROADS_ENVIRONMENTAL               42.524     44.653       0.95232        0.34111
    Industry_HOME_CONSTRUCTION                     7.4579     40.197       0.18553        0.85284
    Industry_FINANCIAL_OTHER                       5.8455     63.208      0.092481        0.92633
    Industry_CABLE_SATELLITE                      -1.1275      131.6    -0.0085676        0.99317
    Industry_LODGING_LEISURE                       4.4847     35.959       0.12472        0.90077
    Industry_Utilities_Genco                       20.092     48.759       0.41207        0.68036
    Industry_REFINING                             -14.143     50.302      -0.28117        0.77862
    Industry_REITS_HEALTHCARE                      63.194     50.417        1.2534        0.21027
    Industry_INTEGRATED                            79.255     71.394        1.1101        0.26716
    Industry_HEALTHCARE_REITS                     -597.22     100.19       -5.9606     3.2334e-09
    Industry_RETAIL_REITS                          33.916     94.718       0.35808        0.72034
    Industry_AUTOMOTIVE_AUTO_FINCO                  53.97     69.608       0.77535        0.43828
    Industry_HIGHER_ED_TXCRP                       19.481     131.76       0.14785        0.88248
    Industry_ENVIRONMENTAL                        -58.522     131.98      -0.44342        0.65753
    Industry_BANKING_US_PFD                        32.933     78.824        0.4178        0.67616
    Industry_AIRLINES_EETC_A                       33.805      95.25        0.3549        0.72272
    Industry_INSURANCE_US_SUBORDINATED             53.352     78.697       0.67795        0.49793
    Industry_TOBACCO                               41.967     69.422       0.60451        0.54561
    Industry_BANKING_GLOBAL_TLAC_SR                60.837     131.21       0.46367        0.64296
    Industry_UTILITY_OTHER                         5.1819     131.24      0.039483        0.96851
    Rating_BA2                                     30.788     20.998        1.4662        0.14282
    Rating_AA1                                    -72.611     129.89      -0.55901        0.57625
    Rating_BAA3                                   -6.6247     19.214      -0.34478        0.73032
    Rating_A3                                     -46.588     22.728       -2.0498       0.040584
    Rating_B3                                      78.003     24.971        3.1238      0.0018249
    Rating_AA3                                    -42.946     38.267       -1.1223        0.26196
    Rating_BAA1                                   -30.175     20.288       -1.4874        0.13716
    Rating_B1                                      33.131     21.115         1.569        0.11688
    Rating_BA3                                     14.299     20.243       0.70638        0.48008
    Rating_B2                                      20.609     22.265       0.92559        0.35483
    Rating_A1                                     -71.461     28.924       -2.4707       0.013613
    Rating_A2                                     -54.844     24.399       -2.2478       0.024757
    Rating_BAA2                                   -24.753     19.031       -1.3007         0.1936
    Rating_CAA3                                    534.91     43.266        12.363     2.8422e-33
    Rating_AA2                                    -113.84     52.126       -2.1839       0.029147
    Rating_CAA1                                    169.91     27.988        6.0709      1.667e-09
    Rating_CAA2                                    413.13     39.372        10.493     8.8261e-25
    Rating_CA                                      788.56      57.97        13.603     1.7066e-39
    Rating_NR                                     -98.999     131.24      -0.75431         0.4508
    Rating_AAA                                    -48.279     93.631      -0.51563         0.6062
    Rating_C                                       3773.2     94.223        40.045    5.4623e-229
    Liquid_1                                       7.5509     10.665       0.70804        0.47905
    Liquid_4                                       18.731     11.687        1.6027        0.10925
    Liquid_5                                        28.48     13.186        2.1599       0.030965
    Liquid_3                                       30.728     11.029        2.7862      0.0054111


Number of observations: 1386, Error degrees of freedom: 1298
Root Mean Squared Error: 128
R-squared: 0.643,  Adjusted R-Squared: 0.619
F-statistic vs. constant model: 26.9, p-value = 7.06e-231

Guohua on 15 Oct 2024

Got it, thank you, this is very helpful.

Sign in to comment.

How does fitlm set reference level with categorical variables?

0 Comments
Show -2 older comments Hide -2 older comments

Answers (1)

9 Comments
Show 7 older comments Hide 7 older comments

Categories

Tags

Community Treasure Hunt

How does fitlm set reference level with categorical variables?

0 Comments Show -2 older comments Hide -2 older comments

Answers (1)

9 Comments Show 7 older comments Hide 7 older comments

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

9 Comments
Show 7 older comments Hide 7 older comments