Help assess my Random Forest work and work on feature selection

3 views (last 30 days)
Hi,
Basically, I'm doing tree species classification from hyperspectral data (so we have 114 features, which represent different spectral bands at different wavelengths, and 6596 observations, representing the image at the pixel level of a tree.) The end goal is to train a Random Forest to classify the species of each pixel of the tree, but at this point in time, I'm just working on distinguishing between broadleaf and conifer type trees, but I'm stuck on feature selection, and ultimately am unsure of what the best overall approach using matlab is with Random Forest, as I've only ever coded in python before and this is my first step into machine learning.
Here is the basis of the code I am using:
features=X;
classLabels=y_cat;
%%Split Data Into Training and Testing Set
%holdout cross validation method, holding out 33% of data to cross validate
%with
rng(1);
cvp=cvpartition(length(X),'holdout',0.33);
%Training set
Xtrain=X(training(cvp),:);
Ytrain=y_cat(training(cvp),:);
%Testing set
Xtest=X(test(cvp),:);
Ytest=y_cat(test(cvp),:);
%%Initial Training of the Ensemble
opts=statset('UseParallel',true);
rng('default');
numTrees=1000;
A=TreeBagger(numTrees,Xtrain,Ytrain,'method','classification','Options',opts,...
'OOBVarImp','On','OOBPredictorImportance','on');
%%Running Cross Validation of Ensemble
classA=@(Xtrain,Ytrain,Xtest)(predict(A,Xtest));
%Calculating Mean Misclassification Rate over a 10 fold, randomized repeat of
%classificaiton
Miss_Class_Rate_Initial=crossval('mcr',X,y_cat,'predfun',classA,'partition',cvp);
Basically, with feature selection, I've tried several different processes that haven't worked out, mainly I think due to the nature of my data. The hyperspectral data goes from the visible light wavelength, all the way up to the infrared wavelengths. This means that 2 features next to each other may be very important in terms of tree species classification, but since they're bands that are similar wavelengths, meaning they're similar colors, they're very correlated. For example, feature 69 and feature 70, which are two of the "best" features based on OOBPredictorImportance, have a 99.79% correlation rate.
PCA, sequentialfs forward and sequentialfs backward all gave me the same, if not lower, accuracy than when I run all 114 features. And every time I've run them, I've done a 1:1:114 loop adding in the next best ranked feature from the algorithm each time, plotting the loss function for each, and it still has unsatisfactory results. If anybody could lead me in the proper direction for feature selection, especially anybody that has worked with hyperspectral data before, that would be great.
Also, just if anybody could look at the raw version of my code that I put up and tell me if there's anything else I could be changing around to increase the accuracy, that would be great. I've researched these forums for weeks, I just feel as if some of the answers to my issues that are specific towards this type of data, which is why I'm starting my own thread.
Thank you in advance for any and all help/advice.

Answers (1)

Ilya
Ilya on 7 Jul 2017
First, you would increase your chance of getting a useful reply if you simplified the problem. Your code and your question are really convoluted. Split the problem in smaller steps. Make clear statements and ask clear questions about each step.
You can address feature selection without optimizing TreeBagger parameters. The default TreeBagger parameters are pretty good for classification accuracy. Use the default parameters to find the feature selection method that suits your needs.
Second, solving a problem starts with clearly posing the problem. You say "PCA failed, sequentialfs forward and backward failed". Why do you think they failed? You say "That's because most feature selection algorithms wanted me to take out either all of my visible light data, or all of my infrared data, which is not viable." Why is this not viable? Do you get the same classification accuracy after all your say infrared data are taken out as the accuracy with all features included? If yes, what is the extra criterion you use to determine that this is not viable? If no, this implies that sequentialfs was not correctly run or perhaps its results were not correctly interpreted.
  2 Comments
Anthony
Anthony on 7 Jul 2017
Hi,
Thank you for your response. I clarified it all above as best as possible, let me know if you think I should make further edits.
-Anthony
Ilya
Ilya on 9 Jul 2017
Your code does not do what you think it does. Your classA function always returns predictions for the same TreeBagger model in A. Including for observations that were used for training model A. So your estimate of accuracy is way too optimistic. To cross-validate correctly, you need to make training part of the classA function. Your classA function should look like this:
classA=@(Xtrain,Ytrain,Xtest) predict(TreeBagger(numTrees,Xtrain,Ytrain),Xtest);
For TreeBagger, you do not need to cross-validate. You can use OOB estimates to measure generalization error.
Regarding feature selection. You say "PCA, sequentialfs forward and sequentialfs backward all gave me the same, if not lower, accuracy than when I run all 114 features." Selecting a subset of features likely is not going to increase accuracy. Random forest usually obtains highest accuracy with all features included. A more realistic question would be: How many features can I discard without losing much accuracy?

Sign in to comment.

Categories

Find more on Dimensionality Reduction and Feature Extraction in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!