Need advice for coding dummyvar vectors - Regression

3 views (last 30 days)
How should I properly code dummvar vectors for use in regression analysis in MATLAB? I have attached a sample table of data (.xlsx file) that I wish to import into MATLAB then run regressions (outcome is last column in table). Some of the categoricals are WindAirport, WindRail, etc. and are simply coded as 1 or 0 (as 'double' variable types); others are logicals T/F. For my model output I need to show both the groups, such as: Site_0 and Site_1 and their regression slope coefficients, as well as the model intercept term and its coefficient. Shall all dummyvars be categorical ? logical ? or double to acheive the desired model output? I will use fitglm as the model function. Any advice is welcome. Thank you.
T=readtable('chels_sample.xlsx'); % alternatively, code as T = xlsread('chels_sample.xlsx')
modelspec = 'lnUFP~ 1 + Day_0 + Day_1 + WindAirport + WindRail'; % just a few binary terms, for example
mdl = fitglm(T,modelspec,'Distribution','normal')

Answers (1)

the cyclist
the cyclist on 14 Jun 2023
Edited: the cyclist on 14 Jun 2023
The documentation for fitglm states
"If data is in a table or dataset array tbl, then, by default, fitglm treats all categorical values, logical values, character arrays, string arrays, and cell arrays of character vectors as categorical variables."
It looks like Day_0 and Day_1 were read in as logical
T=readtable('chels_sample.xlsx');
class(T.Day_0)
ans = 'logical'
class(T.Day_1)
ans = 'logical'
but that WindAirport and WindRail were not:
class(T.WindAirport)
ans = 'double'
class(T.WindRail)
ans = 'double'
therefore I would explicitly convert those
T.WindAirport = categorical(T.WindAirport);
T.WindRail = categorical(T.WindRail);
before calling the model
modelspec = 'lnUFP~ 1 + Day_0 + Day_1 + WindAirport + WindRail'; % just a few binary terms, for example
mdl = fitglm(T,modelspec,'Distribution','normal')
mdl =
Generalized linear regression model: lnUFP ~ 1 + Day_0 + Day_1 + WindAirport + WindRail Distribution = Normal Estimated Coefficients: Estimate SE tStat pValue ________ ________ _______ ___________ (Intercept) 9.2949 0.072015 129.07 2.0029e-109 WindAirport_1 0.7141 0.35646 2.0033 0.047963 WindRail_1 -0.4161 0.10031 -4.1483 7.2379e-05 99 observations, 96 error degrees of freedom Estimated Dispersion: 0.244 F-statistic vs. constant model: 12.1, p-value = 2.11e-05
The coefficient of WindAirport_1 is when the value is (categorical) 1. WindAirport=0 is the reference level.
  3 Comments
the cyclist
the cyclist on 15 Jun 2023
The overall model intercept term is in the output: Intercept = 9.2949. The intercept is the value of the response when
  • all categorical explanatory variables are at their reference level, and
  • all continuous explanatory values are zero
I notice that Day_0 and Day_1 are constant in your data, which I expect is why there are no estimated coefficients for them. (Perhaps you only uploaded a subset of the data?) If they are constant, they should not be in the model. The same seems to be true for Site_0 and Site_1, and many of your other variables. So, I don't understand that.
For the categorical variables that do have different values (e.g. WindRail), the estimate reported is the change in response for the different levels (e.g. WindRail=1), relative to the reference level (WindRail=0). I would not call that a slope, which would only be calculated for a continuous variable.

Sign in to comment.

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!