Dummy variable coding in mixed models (LME)

72 views (last 30 days)
Paul
Paul on 10 Dec 2018
Commented: Paul on 16 Feb 2024
Hi all,
I've been a little perplexed by the different ways to code dummy variables when fitting a linear mixed model (using fitlme). Specifically I have a model with two categorical fixed factors. I'd like to do contrasts between the different levels of each factor. Now several online sources tell me I should use 'effects' coding for this, but it isn't clear to me why this is, nor is it clear to me how I should code my contrast matrix when using 'effects' coding rather than reference coding.
For example, if I have a model with an intercept and one categorical fixed factor with three levels, such that:
T = table(y,var1); % y is my response variable, var1 a categorical variable with three levels
formula = 'y ~ var1';
m1 = fitlme(T,formula,'DummyVarCoding', 'reference');
me1 = fitlme(T,formula,'DummyVarCoding', 'effects');
Now I'd like to test whether there's a difference between the first and second categories. In the case of the m1 (reference coding) and me1 (effects coding) I would do this as follows:
[p,F,df1,df2] = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);
This gives me very different results even though in both cases multiplying each contrast matrix with the associated fixed effects matrix gives me the same value (according to the documentation for coefTest: "It tests the null hypothesis that H0: Hβ = 0, where β is the fixed-effects vector."
When I try both options with some simulated data, where the first two levels of the fixed factor differ significantly, I only find this significant difference when using reference coding. Any insight would be greatly appreciated.
Here's the code for some simulated data:
d1 = zeros(100,1)+randn(100,1);
d2 = ones(100,1)+randn(100,1);
d3 = ones(100,1).*2+randn(100,1);
d = [d1;d2;d3];
cov1 = [zeros(100,1);ones(100,1);ones(100,1).*2];
T = table(d,categorical(cov1),'VariableNames',{'d','cov1'});
f = 'd~cov1';
m1 = fitlme(T,f);
me1 = fitlme(T,f,'DummyVarCoding','effects');
[p,F,df1,df2] = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);

Accepted Answer

Paul
Paul on 24 Jul 2020
Edited: Paul on 24 Jul 2020
Hi! Yes I solved it.
The problem was that, in the example above, I calculated the H vector incorrectly. This is the vector [0 1 0] which is input to the coefTest() function and which tells the function which contrast to perform.
To compute correct H vectors we need to first look at the model coefficients. For the dummy coded model these are:
(intercept)
Cov_1
Cov_2
And for the effects coded model these are:
(intercept)
Cov_0
Cov_1
To compute the H vector for a contrast, you need to make a vector for each of the two conditions you want to contrast and subtract those vectors from each other. In this case we want to compare condition 0 with condition 1.
For the dummy coded model condition 0 is represented by only the intercept, because in dummy coding the intercept is set equal to the estimate of the first condition, in other words the intercept = Cov_0, so this vector = [1 0 0]. Condition 1 is represented by the intercept + the coefficient Cov_1, so the vector = [1 1 0]. The difference between these vectors: H= [0 -1 0]. You could flip the order of the subtraction around: H = [0 1 0] would be equivalent.
For the effects coded model the intercept does NOT represent condition 0. Instead, condition 0 is now represented by the intercept + the coefficient Cov_0, so this vector = [1 1 0]. The second condition is equal to the intercept + Cov_1, just like in the dummy coded model, but now this vector = [1 0 1]. The difference between these two is H = [0 1 -1] or [0 -1 1].
Using the new vectors both coefTest(m1, [0 1 0]) and coefTest(me1, [0 1 -1]) return the same output.
I personally prefer using dummy coding for doing contrasts, because I find it a bit easier to intuit the H vector, but in principle they are entirely equivalent. When using the anova() function though, you NEED to use the effects coded model to get accurate statistics.
However, you might wonder how to compute contrasts with the third condition, represented by Cov_2, when using effects coding. Cov_2 is not explicitly mentioned in the effects coded model, so how does this work? Well condition 2 is represented by the vector [1 -1 -1], so a contrast between condition 1 and 2 would give H = [1 0 1] - [1 -1 -1] = [0 1 2] when using effects coding. Whereas using dummy coding it would give H = [1 1 0] - [1 0 1] = [0 1 -1].
Another thing I found hard to figure out because the Matlab documentation doesn't discuss it, is how to compare groups of conditions, for instance when you have nested categories. I thought it might be helpful to mention here as well. For example imagine you want to test the efficacy of two diets on weight loss and you have data from both men and women. You would have four categories: weight loss with diet 1 and weight loss with diet 2 for both men and women. You want to study the interaction between gender and type of diet. Your model would be: 'Weight loss ~ DietType * Gender'.
Your dummy coded model would have the following coefficients:
(intercept) < which represents the category Diet_0 + Gender_0
Diet_1
Gender_1
Diet_1*Gender_1
If you want to see whether there's a difference between the first and second diet, the procedure is to add the vectors for each individual category within a group, and then subtract the two group vectors, as follows:
(v1 + v2) - (v3 + v4) = H
Here v1 represents Gender_0 on Diet_0 (just the intercept), v2 represents Gender_1 on Diet_0, v3 represents Gender_0 on Diet_1 and v4 represents Gender_1 on Diet_1. The calculation would be:
([1 0 0 0] + [1 0 1 0]) - ([1 1 0 0]+[1 1 1 1]) = [0 -2 0 -1] = H
If you want to compare efficacy of dieting between men and women, you would say H =
([1 0 0 0] + [1 1 0 0]) - ([1 0 1 0] + [1 1 1 1]) = [0 0 -2 -1]
  4 Comments
LM_BU
LM_BU on 15 Feb 2024
Hi, @Paul, I am not sure how you have arrived at the conclusion that the intercept must be flagged as 1 for each level, i.e., why Cov_0 is [1 1 0].
Based on the MATLAB documentation of fitglme coefTest, where they also use effect coding, they have Supplier B and Supplier C. When you add them together to find the main effect of supplier as [0,0,0,0,1,0;0 0 0 0 0 1], the p value is in agreement with the anova results (i.e., anova(glme)).
If we do [1,0,0,0,1,0;1 0 0 0 0 1], where intercept is also added, the p value is wildly different. To be honest, I'm not sure what this contrast matrix is supposed to be testing.
Can you please clarify? Thanks!
Paul
Paul on 16 Feb 2024
Hi LM_BU,
I don't think I said that "Cov_0 is [1 1 0]". I did say that the first condition is represented by "intercept + Cov_0 is [1 1 0]". In effects coding, every condition contains the intercept; that's part of the definition of the effects coding and as far as I can tell this is also in line with the contents of the documentation page you linked.
I do not see the example you mention (of testing the main effect of supplier) ont he page you linked. Are we both looking at the docs for 2023b?
I do agree with you that when testing for the main effect of supplier, you should leave out the intercept, because you are interested in the effect of supplier, not of the intercept.
In my examples above, I am mostly talking about testing between conditions, which is different from testing main effects. When testing between conditions, you subtract the effects vectors from the two conditions. Since a condition in effects coding always includes the intercept, subtracting effects vectors of any two conditions will always produce an H vector with the intercept element == 0. Example: if you contrast condition 1 [1 0 0] with condition 2 [1 1 0], your H vector will be [1 0 0] - [1 1 0] = [0 -1 0].

Sign in to comment.

More Answers (1)

Xue Zhang
Xue Zhang on 24 Jul 2020
Same question here, have you figured out why? Thanks!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!