Main Content

quantileError

Class: TreeBagger

Quantile loss using bag of regression trees

Description

example

err = quantileError(Mdl,X) returns half of the mean absolute deviation (MAD) from comparing the true responses in the table X to the predicted medians resulting from applying the bag of regression trees Mdl to the observations of the predictor data in X.

  • Mdl must be a TreeBagger model object.

  • The response variable name in X must have the same name as the response variable in the table containing the training data.

example

err = quantileError(Mdl,X,ResponseVarName) uses the true response and predictor variables contained in the table X. ResponseVarName is the name of the response variable and Mdl.PredictorNames contain the names of the predictor variables.

example

err = quantileError(Mdl,X,Y) uses the predictor data in the table or matrix X and the response data in the vector Y.

example

err = quantileError(___,Name,Value) uses any of the previous syntaxes and additional options specified by one or more Name,Value pair arguments. For example, specify quantile probabilities, the error type, or which trees to include in the quantile-regression-error estimation.

Input Arguments

expand all

Bag of regression trees, specified as a TreeBagger model object created by TreeBagger. The value of Mdl.Method must be regression.

Sample data used to estimate quantiles, specified as a numeric matrix or table.

Each row of X corresponds to one observation, and each column corresponds to one variable. If you specify Y, then the number of rows in X must be equal to the length of Y.

  • For a numeric matrix:

    • The variables making up the columns of X must have the same order as the predictor variables that trained Mdl (stored in Mdl.PredictorNames).

    • If you trained Mdl using a table (for example, Tbl), then X can be a numeric matrix if Tbl contains all numeric predictor variables. If Tbl contains heterogeneous predictor variables (for example, numeric and categorical data types), then quantileError throws an error.

    • Specify Y for the true responses.

  • For a table:

    • quantileError does not support multi-column variables and cell arrays other than cell arrays of character vectors.

    • If you trained Mdl using a table (for example, Tbl), then all predictor variables in X must have the same variable names and data types as those variables that trained Mdl (stored in Mdl.PredictorNames). However, the column order of X does not need to correspond to the column order of Tbl. Tbl and X can contain additional variables (response variables, observation weights, etc.).

    • If you trained Mdl using a numeric matrix, then the predictor names in Mdl.PredictorNames and corresponding predictor variable names in X must be the same. To specify predictor names during training, see the PredictorNames name-value pair argument of TreeBagger. All predictor variables in X must be numeric vectors. X can contain additional variables (response variables, observation weights, etc.).

    • If X contains the response variable:

      • If the response variable has the same name as the response variable that trained Mdl, then you do not have to supply the response variable name or vector of true responses. quantileError uses that variable for the true responses by default.

      • You can specify ResponseVarName or Y for the true responses.

Data Types: table | double | single

Response variable name, specified as a character vector or string scalar. ResponseVarName must be the name of the response variable in the table of sample data X.

If the table X contains the response variable, and it has the same name as the response variable used to train Mdl, then you do not have to specify ResponseVarName. quantileError uses that variable for the true responses by default.

Data Types: char | string

True responses, specified as a numeric vector. The number of rows in X must be equal to the length of Y.

Data Types: double | single

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Ensemble error type, specified as the comma-separated pair consisting of 'Mode' and a value in this table. Suppose tau is the value of Quantile.

ValueDescription
'cumulative'

err is a Mdl.NumTrees-by-numel(tau) numeric matrix of cumulative quantile regression errors. err(j,k) is the tau(k) quantile regression error using the learners in Mdl.Trees(1:j) only.

'ensemble'

err is a 1-by-numel(tau) numeric vector of cumulative quantile regression errors for the entire ensemble. err(k) is the tau(k) ensemble quantile regression error.

'individual'

err is a Mdl.NumTrees-by-numel(tau) numeric matrix of quantile regression errors from individual learners. err(j,k) is the tau(k) quantile regression error using the learner in Mdl.Trees(j) only.

For 'cumulative' and 'individual', if you include fewer trees in quantile estimation using Trees or UseInstanceForTree, then the number of rows in err decreases from Mdl.NumTrees.

Example: 'Mode','cumulative'

Observation weights, specified as the comma-separated pair consisting of 'Weights' and a numeric vector of positive values with length equal to size(X,1). quantileError uses Weights to compute the weighted average of the deviations when estimating the quantile regression error.

By default, quantileError attributes a weight of 1 to each observation, which yields an unweighted average of the deviations.

Quantile probability, specified as the comma-separated pair consisting of 'Quantile' and a numeric vector containing values in the interval [0,1]. For each element in Quantile, quantileError returns corresponding quantile regression errors for all probabilities in Quantile.

Example: 'Quantile',[0 0.25 0.5 0.75 1]

Data Types: single | double

Indices of trees to use in response estimation, specified as the comma-separated pair consisting of 'Trees' and 'all' or a numeric vector of positive integers. Indices correspond to the cells of Mdl.Trees; each cell therein contains a tree in the ensemble. The maximum value of Trees must be less than or equal to the number of trees in the ensemble (Mdl.NumTrees).

For 'all', quantileError uses all trees in the ensemble (that is, the indices 1:Mdl.NumTrees).

Values other than the default can affect the number of rows in err.

Example: 'Trees',[1 10 Mdl.NumTrees]

Data Types: char | string | single | double

Weights to attribute to responses from individual trees, specified as the comma-separated pair consisting of 'TreeWeights' and a numeric vector of numel(trees) nonnegative values. trees is the value of Trees.

If you specify 'Mode','individual', then quantileError ignores TreeWeights.

Data Types: single | double

Indicators specifying which trees to use to make predictions for each observation, specified as the comma-separated pair consisting of 'UseInstanceForTree' and an n-by-Mdl.Trees logical matrix. n is the number of observations (rows) in X. Rows of UseInstanceForTree correspond to observations and columns correspond to learners in Mdl.Trees. 'all' indicates to use all trees for all observations when estimating the quantiles.

If UseInstanceForTree(j,k) = true, then quantileError uses the tree in Mdl.Trees(k) when it predicts the response for the observation X(j,:).

You can estimate quantiles using the response data in Mdl.Y directly instead of using the predictions from the random forest by specifying a row composed entirely of false values. For example, to estimate the quantile for observation j using the response data, and to use the predictions from the random forest for all other observations, specify this matrix:

UseInstanceForTree = true(size(Mdl.X,2),Mdl.NumTrees);
UseInstanceForTree(j,:) = false(1,Mdl.NumTrees);

Values other than the default can affect the number of rows in err. Also, the value of Trees affects the value of UseInstanceForTree. Suppose that U is the value of UseInstanceForTree. quantileError ignores the columns of U corresponding to trees not being used in estimation from the specification of Trees. That is, quantileError resets the value of 'UseInstanceForTree' to U(:,trees), where trees is the value of 'Trees'.

Data Types: char | string | logical

Output Arguments

expand all

Half of the quantile regression error, returned as a numeric scalar or T-by-numel(tau) matrix. tau is the value of Quantile.

T depends on the values of Mode, Trees, UseInstanceForTree, and Quantile. Suppose that you specify 'Trees',trees and you use the default value of 'UseInstanceForTree'.

  • For 'Mode','cumulative', err is a numel(trees)-by-numel(tau) numeric matrix. err(j,k) is the tau(k) cumulative quantile regression error using the learners in Mdl.Trees(trees(1:j)).

  • For 'Mode','ensemble', err is a 1-by-numel(tau) numeric vector. err(k) is the tau(k) cumulative quantile regression error using the learners in Mdl.Trees(trees).

  • For 'Mode','individual', err is a numel(trees)-by-numel(tau) numeric matrix. err(j,k) is the tau(k) quantile regression error using the learner in Mdl.Trees(trees(j)).

Examples

expand all

Load the carsmall data set. Consider a model that predicts the fuel economy of a car given its engine displacement, weight, and number of cylinders. Consider Cylinders a categorical variable.

load carsmall
Cylinders = categorical(Cylinders);
X = table(Displacement,Weight,Cylinders,MPG);

Train an ensemble of bagged regression trees using the entire data set. Specify 100 weak learners.

rng(1); % For reproducibility
Mdl = TreeBagger(100,X,'MPG','Method','regression');

Mdl is a TreeBagger ensemble.

Perform quantile regression, and estimate the MAD of the entire ensemble using the predicted conditional medians.

err = quantileError(Mdl,X)
err = 1.2339

Because X is a table containing the response and commensurate variable names, you do not have to specify the response variable name or data. However, you can specify the response using this syntax.

err = quantileError(Mdl,X,'MPG')
err = 1.2339

Load the carsmall data set. Consider a model that predicts the fuel economy of a car given its engine displacement, weight, and number of cylinders.

load carsmall
X = table(Displacement,Weight,Cylinders,MPG);

Randomly split the data into two sets: 75% training and 25% testing. Extract the subset indices.

rng(1); % For reproducibility 
cvp = cvpartition(size(X,1),'Holdout',0.25);
idxTrn = training(cvp);
idxTest = test(cvp);

Train an ensemble of bagged regression trees using the training set. Specify 250 weak learners.

Mdl = TreeBagger(250,X(idxTrn,:),'MPG','Method','regression');

Estimate the cumulative 0.25, 0.5, and 0.75 quantile regression errors for the test set. Pass the predictor data in as a numeric matrix, and the response data in as a vector.

err = quantileError(Mdl,X{idxTest,1:3},MPG(idxTest),'Quantile',[0.25 0.5 0.75],...
    'Mode','cumulative');

err is a 250-by-3 matrix of cumulative quantile regression errors. Columns correspond to quantile probabilities and rows correspond to trees in the ensemble. The errors are cumulative, so they incorporate aggregated predictions from previous trees. Although, Mdl was trained using a table, if all predictor variables in the table are numeric, then you can supply a matrix of predictor data instead.

Plot the cumulative quantile errors on the same plot.

figure;
plot(err);
legend('0.25 quantile error','0.5 quantile error','0.75 quantile error');
ylabel('Quantile error');
xlabel('Tree index');
title('Cumulative Quantile Regression Error')

Training using about 60 trees appears to be enough for the first two quartiles, but the third quartile requires about 150 trees.

More About

expand all

Tips

  • To tune the number of trees in the ensemble, set 'Mode','cumulative' and plot the quantile regression errors with respect to tree indices. The maximal number of required trees is the tree index where the quantile regression error appears to level off.

  • To investigate the performance of a model when the training sample is small, use oobQuantileError instead.

References

[1] Breiman, L. Random Forests. Machine Learning 45, pp. 5–32, 2001.

[2] Meinshausen, N. “Quantile Regression Forests.” Journal of Machine Learning Research, Vol. 7, 2006, pp. 983–999.

Introduced in R2016b