Fixed Effects Design Matrix Must be of full column rank with multiple categorical predictors

81 views (last 30 days)
I am probably doing something very dumb, however I cannot figure out my mistake.
I am trying to regress out some predictors from a data set -- I have two categorical predictors, A1 and A2 in a table, something like this:
It seems obvious to me that A1 and A2 are linearly independent. They are also linearly independent from the intercept, which I believe should be a categorical variable that looks like ones(1,11) ? But regardless, I want the global mean to not be removed from everything, so I don't include an intercept in the model.
Then, if I run something like this:
lme = fitlme('values ~ A1 + A2 -1, 'DummyVarCoding','full' )
I always get the same error :
Error using classreg.regr.lmeutils.StandardLinearLikeMixedModel/validateInputs (line 229)
Fixed Effects design matrix X must be of full column rank.
I don't understand why this is happening -- and probably this shows that I have a pretty big misunderstanding of what the dummy variables actually are.
However, if I run two fitlme's -- one on the subset A1==1 and one on A1==0, they both work, which just super confuses me.

Answers (1)

Ive J
Ive J on 29 Jan 2022
The error is self-explanatory, and the reason is full dummy variable scheme you're using (why?). See here
Note that the error has nothing to do with mixed-model design. Consider this example:
n = 100; % sample size
tab = table(randn(n,1), categorical(randi([0 1], n, 1)), ...
categorical(randi([0, 1], n, 1)),...
'VariableNames', {'value', 'A1', 'A2'});
mdl1 = fitlm(tab, 'value ~ A1 + A2 - 1', 'DummyVarCoding', 'full') % design matrix is rank deficient
Warning: Regression design matrix is rank deficient to within machine precision.
mdl1 =
Linear regression model: value ~ A1 + A2 Estimated Coefficients: Estimate SE tStat pValue _________ _______ ________ _______ A1_0 -0.20234 0.20399 -0.99191 0.32373 A1_1 0 0 NaN NaN A2_0 -0.045804 0.17202 -0.26627 0.7906 A2_1 0.097693 0.18145 0.53839 0.59155 Number of observations: 100, Error degrees of freedom: 97 Root Mean Squared Error: 1.02 R-squared: 0.0145, Adjusted R-Squared: -0.00585 F-statistic vs. constant model: 0.712, p-value = 0.493
So, what happened? Let's construct the design matrix:
X = [dummyvar(tab.A1), dummyvar(tab.A2)]; % DummyVarCoding -> full
disp(rank(X)) % 3 < size(X, 2) --> 3 < 4 --> rank deficient
% what about when considering them alone?
disp(rank(X(:, 1:2))) % full rank
disp(rank(X(:, 3:4))) % full rank
We can approximately find the problematic variable:
[~, R] = qr(X, 0);
find(abs(diag(R)) < 1e-6)
ans = 4
Therefore, don't set 'DummyVarCoding' in such cases (default is 'reference')

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!