feature reduction via regression analysis
Show older comments
Suppose you have a very large feature vector X, used to predict a a vector of expected values y.
Is the sequential linear linear regression,
e.g.: coeff=regress(y, X);
followed by sequential feature reduction,
e.g. [coeff_subset] = sequentialfs(fun, X, y, 'direction', 'backward');
% where: fun = @(XT,yT,Xt,yt)(rmse(regress(yT, XT)'*Xt')', yt);
the easiest/best approach to get the a reasonable sized feature vecture when no other information is known?
It seems that, from my testing, this method rarely captures the features that matter the most, and I obtained better results by randomly selecting some of the features.
10 Comments
bym
on 16 Jul 2012
How about principal component analysis?
joeDiHare
on 17 Jul 2012
Image Analyst
on 17 Jul 2012
I'm with proecsm. Use PCA. Why keep original components when new ones from combinations of existing ones will give you better discrimination? I haven't heard a good answer for that - just that you don't want to for some reason.
By the way, what is "very large" (how many millions of elements), and what is "reasonable sized"? How many features are there? And how many observations were there? For example, you measured 6 things on 1,000,000 samples so you have 1,000,000 feature vectors, each feature vector being 6 elements long.
Image Analyst
on 18 Jul 2012
It sounds reasonable but I guess you could think of theoretically possible counterexamples, such as you have measurements A, B, and C. A and B correlate highly with the expected value, and C not so much. So you might take A and B and throw away C. BUT, what if A and B are highly correlated with each other, or totally redundant (say B = 0.5 * A)? Then keeping B isn't really gaining anything for you (no additional predictive power) and you'd probably be better off keeping C instead of B. However I'm no expert on this kind of thing so that's when I go asking our company's brilliant statisticians.
Ilya
on 18 Jul 2012
I don't know what you mean by "Corr-->0" for 10 features. Two models, one of which is a subset of the other, can be compared by an F test. This is what the stepwise procedure uses to select predictors. If you use the same data for selecting predictors and testing the equivalence of the full and reduced models, your p-value will be optimistically high. Since you have plenty of data, you can use say 2/3 for selecting predictors and the rest 1/3 to compute the p-value for the F statistic.
It is possible you need (much) more than 10 features to avoid losing the predictive power. If your goal is to select just a few features, forward selection might be a better choice.
If you have 2011b, you should also try lasso function.
joeDiHare
on 18 Jul 2012
Ilya
on 18 Jul 2012
I have trouble interpreting what you wrote in 1 because I still don't know what you mean by correlation. I thought you were saying that the correlation between each individual predictor and the observed response (measured y values) was small for all predictors but one. But setting one predictor to zero cannot have any effect on correlations between the other predictors and the response. And so "correlations went from near zero to back up again" is a mystery to me. Then perhaps by "correlation" you mean correlation between the predicted response and observed response? I don't get how that can be zero after you added the predictor with 94% correlation to the model either. If that happens, something must've gone bad with the fit.
Instead of re-running stepwisefit, I would recommend playing with 'penter' and 'premove' parameters.
joeDiHare
on 18 Jul 2012
Accepted Answer
More Answers (0)
Categories
Find more on Multiple Linear Regression in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!