Why the ‘CategoricalPredictors’ properties are always empty after running fitrensemble()?

4 views (last 30 days)
Hi. I am running a regression tree ensemble, and despite specifying a categorical variable using name-value pair 'CategoricalPredictors', the resulting ‘Mdl’ object looks like it has not used the categorical variable.
Assuming that the default value (1/3) of 'NumVariablesToSample’, might be the reason why the categorical feature is excluded, I set 'NumVariablesToSample’ to ‘all', to make sure that all variables are contained in all trees.
%Some data
T=1000; N=20;
X = randn(T,N);
b = zeros(N+1,1); b(2:6) = 2; b(7:11) = -3;
D = [repmat([1 2 3 4 5], [1 200])]'; %Categorical variable
dum = dummyvar(D);
theta = 1:5; theta = theta'/5^2;
Y = [ones(T,1) X]*b + dum*theta + randn(T,1)*0.1;
N = size([X D],2);
clear dum theta b
%Bagged Ensemble
Ntrees = 500;
tr = templateTree('MinParentSize',250,'CategoricalPredictors',21,'NumVariablesToSample','all');
Mdl = fitrensemble([X D],Y,'Method','Bag','Learners',tr,'NumLearningCycles',Ntrees);
rsvm = false(N,Ntrees);
for i = 1:Ntrees
idx = unique(Mdl.Trained{i}.CutPredictorIndex);
idx(idx==0)=[];
rsvm(idx,i) = 1;
end
mean(sum(rsvm)) %<- Average number of features included in each tree
>> ans =
4.6040
Question 1: Despite setting 'NumVariablesToSample’ to ‘all’, when I extract the variables used in each tree (using Mdl.Trained{i}.CutPredictorIndex), on average only 5 out of the 21 features are included in each tree. I was expecting all 21 to be included in all trees. Why this is not the case?
In the Bagged Ensemble above, I further checked and none of the individual trees picks the categorical variable (i.e. variable 21). When instead I fit a Boosted Ensemble, the algorithm still does not pick all the variables (-only picks on average 6 variables). However, it turns out that the categorical variable (#21) is now included in a few of the individual trees.
%Boosted Ensemble
tr = templateTree('MinParentSize',250,'CategoricalPredictors',21,'NumVariablesToSample','all');
Mdl = fitrensemble([X D],Y,'Method','LSBoost','Learners',tr,'NumLearningCycles',Ntrees);
rsvm = false(N,Ntrees);
for i = 1:Ntrees
idx = unique(Mdl.Trained{i}.CutPredictorIndex);
idx(idx==0)=[];
rsvm(idx,i) = 1;
end
mean(sum(rsvm))
>> ans =
6.3160
find(rsvm(21,:)) %<- Trees that contain the categorical variable
>> ans =
Columns 1 through 18
18 31 39 50 60 67 81 151 179 181 195 204 269 298 317 319 337 394
Question 2: Despite the fact that variable 21 is included in a number of trees, property 'CategoricalPredictors' is always empty. Can any explain why this is the case?
Mdl.CategoricalPredictors
>> ans = []
Mdl.Trained{18}.CutPredictorIndex
>> ans = []
Any insights are appreciated.
  1 Comment
Haris K.
Haris K. on 13 Feb 2021
I think this question came out a little bit too long. I will break it down into two different posts. Anyone who would like to help, please see:
I will post the second half once the first one has been asnwered.

Sign in to comment.

Answers (1)

Pratyush Roy
Pratyush Roy on 15 Feb 2021
Hi Haris,
You might consider the following workarounds for your problem:
Answer to Question 1: The "CutPredictorIndex" index property denotes the indices of the columns which are used to split the data reaching a particular node. It might not be necessary that all the columns will be used to split data, since at every node we are trying to find the predictor giving the best split.
But for every tree all the features are considered before splitting if the 'NumVariablesToSample' property is set to 'all'. This is ensured if we display the 'PredictorNames' property for a single tree in an ensemble.
Mdl.Trained{1}.PredictorNames % Gives us the predictors considered for tree indexed 1
Answer to Question 2: As a workaround, you can set the 'CategoricalPredictors' property in the fitrensemble method instead of the templateTree method.
Mdl = fitrensemble(data,Y,'Method','Bag','Learners',tr,'CategoricalPredictors',[21],'NumLearningCycles',Ntrees);
Hope this helps!
Regards,
Pratyush.
  2 Comments
Haris K.
Haris K. on 16 Feb 2021
Dear Pratyush, thank you very much for your help.
When I am running a single regression tree, [using Mdl = fitrtree(X,Y,'NumVariablesToSample','all') ] almost 99% of the variables are used (-again, as captured by Mdl.CutPredictorIndex). Why is this happening with fitrtree(), but not with the individual trees in fitrensemble()? At least with bagging, which is bootstrap aggregated individual trees, I would expect to see similar behaviour to fitrtree().
Is there any way (possibly through a specific combination of hyperparameters) to force all, or at least a considerable number, of the variables into each tree?
Pratyush Roy
Pratyush Roy on 16 Feb 2021
Hi Haris,
The function fitrtree assumes the default value for the parameter "MinParentSize", i.e., minimum number of branch node observations to be 10. However, in the code mentioned above, the value is 250. Reducing that value will allow individual trees to grow deeper and will include more predictor variables.
Other parameters that can be tuned are "MaxNumSplits" and "MinLeafSize".
The documentation link for templateTree might be helpful.
Hope this helps!

Sign in to comment.

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!