feature reduction via regression analysis

Suppose you have a very large feature vector X, used to predict a a vector of expected values y.
Is the sequential linear linear regression,
e.g.: coeff=regress(y, X);
followed by sequential feature reduction,
e.g. [coeff_subset] = sequentialfs(fun, X, y, 'direction', 'backward');
% where: fun = @(XT,yT,Xt,yt)(rmse(regress(yT, XT)'*Xt')', yt);
the easiest/best approach to get the a reasonable sized feature vecture when no other information is known?
It seems that, from my testing, this method rarely captures the features that matter the most, and I obtained better results by randomly selecting some of the features.

10 Comments

How about principal component analysis?
The issue with the PCA is that it creates new components, while I want to keep the original features and get rid of what is not necessary; In principle, Regression should work (?), but I don't know why it doesn't in this case.
I keep having the error of colinearity, as: Columns of X are linearly dependent to within machine precision. Using only the first 1 components to compute TSQUARED
that I believe is related to the poor performance of the regression, but not sure how to cope with it.
I'm with proecsm. Use PCA. Why keep original components when new ones from combinations of existing ones will give you better discrimination? I haven't heard a good answer for that - just that you don't want to for some reason.
By the way, what is "very large" (how many millions of elements), and what is "reasonable sized"? How many features are there? And how many observations were there? For example, you measured 6 things on 1,000,000 samples so you have 1,000,000 feature vectors, each feature vector being 6 elements long.
joeDiHare
joeDiHare on 18 Jul 2012
Edited: joeDiHare on 18 Jul 2012
Good points. Let's make it clearer:
**why not new components? Bc, despite regression being fairly "blind", I chose the features based on their physical meaning, and I am interested in seeing if these specific ones (rather than new components made of these) can capture the expected values.
**size: I have roughly 4k observations and 200 features; (Of the 4k observations, 400 are independent, while the rest are the first 400 processed via 10 different conditions each)
One of these features is 0.94 corelated with expected values, but has different scale, thus the RMSE is high; out of the remaining 199, I only want to select one or two so that the overall RMSE and Corr are improved.
I tried with regress, sequentialfs and stepwisefit (as suggested by llya).
It sounds reasonable but I guess you could think of theoretically possible counterexamples, such as you have measurements A, B, and C. A and B correlate highly with the expected value, and C not so much. So you might take A and B and throw away C. BUT, what if A and B are highly correlated with each other, or totally redundant (say B = 0.5 * A)? Then keeping B isn't really gaining anything for you (no additional predictive power) and you'd probably be better off keeping C instead of B. However I'm no expert on this kind of thing so that's when I go asking our company's brilliant statisticians.
I don't know what you mean by "Corr-->0" for 10 features. Two models, one of which is a subset of the other, can be compared by an F test. This is what the stepwise procedure uses to select predictors. If you use the same data for selecting predictors and testing the equivalence of the full and reduced models, your p-value will be optimistically high. Since you have plenty of data, you can use say 2/3 for selecting predictors and the rest 1/3 to compute the p-value for the F statistic.
It is possible you need (much) more than 10 features to avoid losing the predictive power. If your goal is to select just a few features, forward selection might be a better choice.
If you have 2011b, you should also try lasso function.
Yes, you're right, Forward works better indeed (and it is quicker). Corr-->0 I meant that correlation decreases, but I got around with it by making sure that the feature that has .94 correlation is always selected (with the 'keep' option). All in all, the remaining 199 do not really help, but I guess it is not the regression fault. I will try PCA and definitely, using train and testing databases. Thanks a lot.
joeDiHare
joeDiHare on 18 Jul 2012
Edited: joeDiHare on 18 Jul 2012
Last TWO things.
1. Intrestingly, results were completely ruined by one feature veacture being close to Inf for all values. By setting it to zero, correlations went from near zero to back up again. Why is it?? If I zscore it, it works a bit, but then it messes up with non "normalised" test values.
2. I noticed that if you re-run stepwisefit, you could get some more feature out. Is there a iterative stepwisefit?
I have trouble interpreting what you wrote in 1 because I still don't know what you mean by correlation. I thought you were saying that the correlation between each individual predictor and the observed response (measured y values) was small for all predictors but one. But setting one predictor to zero cannot have any effect on correlations between the other predictors and the response. And so "correlations went from near zero to back up again" is a mystery to me. Then perhaps by "correlation" you mean correlation between the predicted response and observed response? I don't get how that can be zero after you added the predictor with 94% correlation to the model either. If that happens, something must've gone bad with the fit.
Instead of re-running stepwisefit, I would recommend playing with 'penter' and 'premove' parameters.
Thanks, I will tweak p values to get what I need.
About 1, yes, it is strange to me, but correlation between predicted and observed responses goes to zero because there is one feature that has wild values.
I don't know why it happens, but by setting the bad feature (e.g. feat #71) to 0, corr goes to 94% again (and a bit higher).

Sign in to comment.

 Accepted Answer

If you prefer linear regression, use function stepwisefit or its new incarnation LinearModel.stepwise. For example, for backward elimination with an intercept term you can do
load carsmall
X = [Acceleration Cylinders Displacement Horsepower];
y = MPG;
stepwisefit([ones(100,1) X],y,'inmodel',true(1,5))
In general, there is no "best" approach to feature selection. What you can do depends on what assumptions you are willing to make (such as linear model), how many features you have and how much effort you want to invest.

More Answers (0)

Asked:

on 16 Jul 2012

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!