quantileError
Quantile loss using bag of regression trees
Syntax
Description
returns
half of the mean absolute deviation (MAD) from comparing the true
responses in the table err
= quantileError(Mdl
,X
)X
to the predicted medians
resulting from applying the bag of regression trees Mdl
to
the observations of the predictor data in X
.
Mdl
must be aTreeBagger
model object.The response variable name in
X
must have the same name as the response variable in the table containing the training data.
uses
the true response and predictor variables contained in the table err
= quantileError(Mdl
,X
,ResponseVarName
)X
. ResponseVarName
is
the name of the response variable and Mdl.PredictorNames
contain
the names of the predictor variables.
uses
any of the previous syntaxes and additional options specified by one
or more err
= quantileError(___,Name,Value
)Name,Value
pair arguments. For example,
specify quantile probabilities, the error type, or which trees to
include in the quantile-regression-error estimation.
Input Arguments
Mdl
— Bag of regression trees
TreeBagger
model object (default)
Bag of regression trees, specified as a TreeBagger
model object created by the TreeBagger
function. The value of Mdl.Method
must be
regression
.
X
— Sample data
numeric matrix | table
Sample data used to estimate quantiles, specified as a numeric matrix or table.
Each row of X
corresponds to one observation,
and each column corresponds to one variable. If you specify Y
,
then the number of rows in X
must be equal to the
length of Y
.
For a numeric matrix:
The variables making up the columns of
X
must have the same order as the predictor variables that trainedMdl
(stored inMdl.PredictorNames
).If you trained
Mdl
using a table (for example,Tbl
), thenX
can be a numeric matrix ifTbl
contains all numeric predictor variables. IfTbl
contains heterogeneous predictor variables (for example, numeric and categorical data types), thenquantileError
throws an error.Specify
Y
for the true responses.
For a table:
quantileError
does not support multicolumn variables or cell arrays other than cell arrays of character vectors.If you trained
Mdl
using a table (for example,Tbl
), then all predictor variables inX
must have the same variable names and data types as those variables that trainedMdl
(stored inMdl.PredictorNames
). However, the column order ofX
does not need to correspond to the column order ofTbl
.Tbl
andX
can contain additional variables (response variables, observation weights, etc.).If you trained
Mdl
using a numeric matrix, then the predictor names inMdl.PredictorNames
and corresponding predictor variable names inX
must be the same. To specify predictor names during training, see thePredictorNames
name-value pair argument of theTreeBagger
function. All predictor variables inX
must be numeric vectors.X
can contain additional variables (response variables, observation weights, etc.).If
X
contains the response variable:If the response variable has the same name as the response variable that trained
Mdl
, then you do not have to supply the response variable name or vector of true responses.quantileError
uses that variable for the true responses by default.You can specify
ResponseVarName
orY
for the true responses.
Data Types: table
| double
| single
ResponseVarName
— Response variable name
character vector | string scalar
Response variable name, specified as a character vector or string scalar.
ResponseVarName
must be the name of the response
variable in the table of sample data X
.
If the table X
contains the response variable,
and it has the same name as the response variable used to train Mdl
,
then you do not have to specify ResponseVarName
. quantileError
uses
that variable for the true responses by default.
Data Types: char
| string
Y
— True responses
numeric vector
True responses, specified as a numeric vector. The number of rows in X
must
be equal to the length of Y
.
Data Types: double
| single
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Mode
— Ensemble error type
'ensemble'
(default) | 'cumulative'
| 'individual'
Ensemble error type, specified as the comma-separated pair consisting
of 'Mode'
and a value in this table. Suppose tau
is
the value of Quantile
.
Value | Description |
---|---|
'cumulative' |
|
'ensemble' |
|
'individual' |
|
For 'cumulative'
and 'individual'
,
if you include fewer trees in quantile estimation using Trees
or UseInstanceForTree
,
then the number of rows in err
decreases from Mdl.NumTrees
.
Example: 'Mode','cumulative'
Weights
— Observation weights
ones(size(X,1),1)
(default) | numeric vector of positive values
Observation weights, specified as the comma-separated pair consisting
of 'Weights'
and a numeric vector of positive values
with length equal to size(X,1)
. quantileError
uses Weights
to
compute the weighted average of the deviations when estimating the
quantile regression error.
By default, quantileError
attributes a weight
of 1
to each observation, which yields an unweighted
average of the deviations.
Quantile
— Quantile probability
0.5
(default) | numeric vector containing values in [0,1]
Quantile probability, specified as the comma-separated pair
consisting of 'Quantile'
and a numeric vector containing
values in the interval [0,1]. For each element in Quantile
, quantileError
returns
corresponding quantile regression errors for all probabilities in Quantile
.
Example: 'Quantile',[0 0.25 0.5 0.75 1]
Data Types: single
| double
Trees
— Indices of trees to use in response estimation
'all'
(default) | numeric vector of positive integers
Indices of trees to use in response estimation, specified as
the comma-separated pair consisting of 'Trees'
and 'all'
or
a numeric vector of positive integers. Indices correspond to the cells
of Mdl.Trees
; each cell therein contains a tree
in the ensemble. The maximum value of Trees
must
be less than or equal to the number of trees in the ensemble (Mdl.NumTrees
).
For 'all'
, quantileError
uses
all trees in the ensemble (that is, the indices 1:Mdl.NumTrees
).
Values other than the default can affect the number of rows
in err
.
Example: 'Trees',[1 10 Mdl.NumTrees]
Data Types: char
| string
| single
| double
TreeWeights
— Weights to attribute to responses from individual trees
ones(Mdl.NumTrees,1)
(default) | numeric vector of nonnegative values
Weights to attribute to responses from individual trees, specified
as the comma-separated pair consisting of 'TreeWeights'
and
a numeric vector of numel(
nonnegative
values. trees
)trees
is the value of Trees
.
If you specify 'Mode','individual'
, then quantileError
ignores TreeWeights
.
Data Types: single
| double
UseInstanceForTree
— Indicators specifying which trees to use to make predictions for each observation
'all'
(default) | logical matrix
Indicators specifying which trees to use to make predictions
for each observation, specified as the comma-separated pair consisting
of 'UseInstanceForTree'
and an n-by-Mdl.Trees
logical
matrix. n is the number of observations (rows)
in X
. Rows of UseInstanceForTree
correspond
to observations and columns correspond to learners in Mdl.Trees
. 'all'
indicates
to use all trees for all observations when estimating the quantiles.
If UseInstanceForTree(
= j
,k
)true
,
then quantileError
uses the tree in Mdl.Trees(
when
it predicts the response for the observation k
)X(
.j
,:)
You can estimate quantiles using the response data in Mdl.Y
directly
instead of using the predictions from the random forest by specifying
a row composed entirely of false
values. For example,
to estimate the quantile for observation j
using
the response data, and to use the predictions from the random forest
for all other observations, specify this matrix:
UseInstanceForTree = true(size(Mdl.X,2),Mdl.NumTrees); UseInstanceForTree(j,:) = false(1,Mdl.NumTrees);
Values other than the default can affect the number of rows
in err
. Also, the value of Trees
affects
the value of UseInstanceForTree
. Suppose that U
is
the value of UseInstanceForTree
. quantileError
ignores
the columns of U
corresponding to trees
not being used in estimation from the specification of Trees
.
That is, quantileError
resets the value of 'UseInstanceForTree'
to U(:,
,
where trees
)trees
is the value of 'Trees'
.
Data Types: char
| string
| logical
Output Arguments
err
— Half of quantile regression error
numeric scalar | numeric matrix
Half of the quantile regression error,
returned as a numeric scalar or T
-by-numel(
matrix. tau
)tau
is
the value of Quantile
.
T
depends on the values of Mode
, Trees
, UseInstanceForTree
,
and Quantile
. Suppose that you specify 'Trees',
and
you use the default value of trees
'UseInstanceForTree'
.
For
'Mode','cumulative'
,err
is anumel(
-by-trees
)numel(
numeric matrix.tau
)err(
is thej
,k
)
cumulative quantile regression error using the learners intau
(k
)Mdl.Trees(
.trees
(1:j
))For
'Mode','ensemble'
,err
is a1
-by-numel(
numeric vector.tau
)err(
is thek
)
cumulative quantile regression error using the learners intau
(k
)Mdl.Trees(
.trees
)For
'Mode','individual'
,err
is anumel(
-by-trees
)numel(
numeric matrix.tau
)err(
is thej
,k
)
quantile regression error using the learner intau
(k
)Mdl.Trees(
.trees
(j
))
Examples
Estimate In-Sample Quantile Regression Error
Load the carsmall
data set. Consider a model that predicts the fuel economy of a car given its engine displacement, weight, and number of cylinders. Consider Cylinders
a categorical variable.
load carsmall
Cylinders = categorical(Cylinders);
X = table(Displacement,Weight,Cylinders,MPG);
Train an ensemble of bagged regression trees using the entire data set. Specify 100 weak learners.
rng(1); % For reproducibility Mdl = TreeBagger(100,X,'MPG','Method','regression');
Mdl
is a TreeBagger
ensemble.
Perform quantile regression, and estimate the MAD of the entire ensemble using the predicted conditional medians.
err = quantileError(Mdl,X)
err = 1.2339
Because X
is a table containing the response and commensurate variable names, you do not have to specify the response variable name or data. However, you can specify the response using this syntax.
err = quantileError(Mdl,X,'MPG')
err = 1.2339
Find Appropriate Ensemble Size Using Quantile Regression Error
Load the carsmall
data set. Consider a model that predicts the fuel economy of a car given its engine displacement, weight, and number of cylinders.
load carsmall
X = table(Displacement,Weight,Cylinders,MPG);
Randomly split the data into two sets: 75% training and 25% testing. Extract the subset indices.
rng(1); % For reproducibility cvp = cvpartition(size(X,1),'Holdout',0.25); idxTrn = training(cvp); idxTest = test(cvp);
Train an ensemble of bagged regression trees using the training set. Specify 250 weak learners.
Mdl = TreeBagger(250,X(idxTrn,:),'MPG','Method','regression');
Estimate the cumulative 0.25, 0.5, and 0.75 quantile regression errors for the test set. Pass the predictor data in as a numeric matrix, and the response data in as a vector.
err = quantileError(Mdl,X{idxTest,1:3},MPG(idxTest),'Quantile',[0.25 0.5 0.75],... 'Mode','cumulative');
err
is a 250-by-3 matrix of cumulative quantile regression errors. Columns correspond to quantile probabilities and rows correspond to trees in the ensemble. The errors are cumulative, so they incorporate aggregated predictions from previous trees. Although, Mdl
was trained using a table, if all predictor variables in the table are numeric, then you can supply a matrix of predictor data instead.
Plot the cumulative quantile errors on the same plot.
figure; plot(err); legend('0.25 quantile error','0.5 quantile error','0.75 quantile error'); ylabel('Quantile error'); xlabel('Tree index'); title('Cumulative Quantile Regression Error')
Training using about 60 trees appears to be enough for the first two quartiles, but the third quartile requires about 150 trees.
More About
Quantile Regression Error
The quantile regression error of a model given observed predictor data and responses is the weighted mean absolute deviation (MAD). If the model under-predicts the response, then deviation weights are τ, the quantile probability. If the model over-predicts, then deviation weights are 1 – τ.
That is, the τ quantile regression error is
yj is true response j, is the τ quantile that the model predicts, and wj is observation weight j.
Tips
To tune the number of trees in the ensemble, set
'Mode','cumulative'
and plot the quantile regression errors with respect to tree indices. The maximal number of required trees is the tree index where the quantile regression error appears to level off.To investigate the performance of a model when the training sample is small, use
oobQuantileError
instead.
References
[1] Breiman, L. Random Forests. Machine Learning 45, pp. 5–32, 2001.
[2] Meinshausen, N. “Quantile Regression Forests.” Journal of Machine Learning Research, Vol. 7, 2006, pp. 983–999.
Version History
Introduced in R2016b
See Also
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)