Regression with tall array (Using datastore, CSV) - Error

1 view (last 30 days)
Hi
  5 Comments
K.P.
K.P. on 12 Jul 2021
x is a 1000x500 (tall) table. This are the first entries:
7 6 12 12 15 13 12 30 71 6
3 4 4 0 0 1 10 2 6 1
1 0 0 0 0 0 2 0 0 0
1 0 4 0 0 0 0 0 4 0
6 3 5 2 0 0 10 0 3 0
3 26 10 3 0 2 15 7 24 1
17 85 5 4 0 0 29 0 6 0
1 0 1 0 0 2 1 0 0 0
2 0 3 0 0 0 9 0 4 0
5 18 11 2 0 1 6 0 3 0
3 1 0 0 0 2 4 0 0 0
2 0 0 0 0 0 0 0 0 0
2 0 10 0 0 0 0 0 0 0
2 0 1 1 0 3 0 0 3 0
2 16 3 0 0 0 3 2 36 1
y is a 1000x1 (tall) table and the first entries are:
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
dpb
dpb on 12 Jul 2021
I just tried to see if it was tall arrays and fitglm
>> X=[1:1000].'; X=tall(X);
>> Y=randn(size(X)); % this is interesting sidelight on the way...
Error using randn
Size inputs must be numeric.
>> size(X)
ans =
1×2 tall double row vector
1000 1
>> Y=randn(1000,1); Y=tall(Y); % OK, have to brute-force it
>> fitglm(X,Y,'Distribution',"normal")
Iteration [1]: 0% completed
Iteration [1]: 50% completed
Iteration [1]: 100% completed
Iteration [2]: 0% completed
Iteration [2]: 50% completed
Iteration [2]: 100% completed
Iteration [3]: 0% completed
Iteration [3]: 100% completed
ans =
Compact generalized linear regression model:
y ~ 1 + x1
Distribution = Normal
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ________ _______
(Intercept) 0.0015036 0.064429 0.023338 0.98139
x1 1.6177e-05 0.00011151 0.14507 0.88468
1000 observations, 998 error degrees of freedom
Estimated Dispersion: 1.04
F-statistic vs. constant model: 0.021, p-value = 0.885
>>
So, fitglm will accept tall arrays; the syntax must be else where it would seem...

Sign in to comment.

Accepted Answer

Ive J
Ive J on 13 Jul 2021
Edited: Ive J on 13 Jul 2021
Well, your data is tall table, and that's what MATLAB complains about: since your first argument is a table, MATLAB thinks y is modelspec. You have two options:
% 1-feed fitglm with matrix
mdl = fitglm(x{:, :}, y{:, :}, 'Link', 'logit', 'Distribution', 'binomial');
% 2-OR: merge x and y as a table
data = [x, y]; % last column is the dependent variable by default
mdl = fitglm(data, 'Link', 'logit', 'Distribution', 'binomial');
Btw, your data is fairly small and (I assume) fits within memory, tall arrays should be avoided for such small datasets.
  2 Comments
K.P.
K.P. on 13 Jul 2021
Hi Ive,
I merged the x and y tables and converted the new table before building the tall array with:
ds = transform(ds,@table2array);
Now it works, Thanks for your help!
PS: the file here was was only a smaller sample. The "real" one is 320000x30000.
Ive J
Ive J on 13 Jul 2021
If I were you I would also test with arrays. Processing tables is almost always (based on my experience) slower than arrays.
Good luck!

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!