Model Building and Assessment

Feature selection, feature engineering, model selection, hyperparameter optimization, cross-validation, residual diagnostics, and plots

When you build a high-quality regression model, it is important to select the right features (or predictors), tune hyperparameters (model parameters not fit to the data), and assess model assumptions through residual diagnostics.

You can tune hyperparameters by iterating between choosing values for them and cross-validating a model using your choices. This process yields multiple models, and the best model among them might be the one that minimizes the estimated generalization error. For example, to tune an SVM model, choose a set of box constraints and kernel scales, cross-validate a model for each pair of values, and then compare their 10-fold, cross-validated, mean squared error estimates.

To engineer new features before training a regression model, use genrfeatures.

To build and assess regression models interactively, use the Regression Learner app.

To automatically select a model with tuned hyperparameters, use fitrauto. The function tries a selection of regression model types with different hyperparameter values and returns a final model that is expected to perform well. Use fitrauto when you are uncertain which regression model types best suit your data.

Certain nonparametric regression functions in Statistics and Machine Learning Toolbox™ offer automatic hyperparameter tuning through Bayesian optimization, grid search, or random search. bayesopt, the main function for implementing Bayesian optimization, is flexible enough for many other applications as well. For more details, see Bayesian Optimization Workflow.

To interpret a regression model, you can use lime, shapley, and plotPartialDependence.

Apps

Regression Learner

Train regression models to predict data using supervised machine learning

Functions

expand all

Feature Selection

`fsrftest`	Univariate feature ranking for regression using F-tests (Since R2020a)
`fsrmrmr`	Rank features for regression using minimum redundancy maximum relevance (MRMR) algorithm (Since R2022a)
`fsrnca`	Feature selection using neighborhood component analysis for regression
`oobPermutedPredictorImportance`	Out-of-bag predictor importance estimates for random forest of regression trees by permutation
`partialDependence`	Compute partial dependence (Since R2020b)
`permutationImportance`	Predictor importance by permutation (Since R2024a)
`plotPartialDependence`	Create partial dependence plot (PDP) and individual conditional expectation (ICE) plots
`predictorImportance`	Estimates of predictor importance for regression tree
`predictorImportance`	Estimates of predictor importance for regression ensemble of decision trees
`relieff`	Rank importance of predictors using ReliefF or RReliefF algorithm
`selectFeatures`	Select important features for NCA classification or regression (Since R2023b)
`sequentialfs`	Sequential feature selection using custom criterion
`stepwiselm`	Perform stepwise regression
`stepwiseglm`	Create generalized linear regression model by stepwise regression

Feature Engineering

`genrfeatures`	Perform automated feature engineering for regression (Since R2021b)
`describe`	Describe generated features (Since R2021a)
`transform`	Transform new data using generated features (Since R2021a)

Automated Model Selection

fitrauto Automatically select regression model with optimized hyperparameters (Since R2020b)

Hyperparameter Optimization

`bayesopt`	Select optimal machine learning hyperparameters using Bayesian optimization
`hyperparameters`	Variable descriptions for optimizing a fit function
`optimizableVariable`	Variable description for `bayesopt` or other optimizers

Cross-Validation

For Time-Independent Data

`crossval`	Estimate loss using cross-validation
`cvpartition`	Partition data for cross-validation
`repartition`	Repartition data for cross-validation
`test`	Test indices for cross-validation
`training`	Training indices for cross-validation

For Time Series Data

`tspartition`	Partition time series data for cross-validation (Since R2022b)
`test`	Test indices for time series cross-validation (Since R2022b)
`training`	Training indices for time series cross-validation (Since R2022b)

Model Interpretation

Local Interpretable Model-Agnostic Explanations (LIME)

`lime`	Local interpretable model-agnostic explanations (LIME) (Since R2020b)
`fit`	Fit simple model of local interpretable model-agnostic explanations (LIME) (Since R2020b)
`plot`	Plot results of local interpretable model-agnostic explanations (LIME) (Since R2020b)

Shapley Values

`shapley`	Shapley values (Since R2021a)
`fit`	Compute Shapley values for query points (Since R2021a)
`plot`	Plot Shapley values using bar graphs (Since R2021a)
`boxchart`	Visualize Shapley values using box charts (box plots) (Since R2024a)
`swarmchart`	Visualize Shapley values using swarm scatter charts (Since R2024a)

Partial Dependence

`partialDependence`	Compute partial dependence (Since R2020b)
`plotPartialDependence`	Create partial dependence plot (PDP) and individual conditional expectation (ICE) plots

Linear Model Diagnostics

`coefCI`	Confidence intervals of coefficient estimates of linear regression model
`coefTest`	Linear hypothesis test on linear regression model coefficients
`dwtest`	Durbin-Watson test with linear regression model object
`plot`	Scatter plot or added variable plot of linear regression model
`plotAdded`	Added variable plot of linear regression model
`plotAdjustedResponse`	Adjusted response plot of linear regression model
`plotDiagnostics`	Plot observation diagnostics of linear regression model
`plotEffects`	Plot main effects of predictors in linear regression model
`plotInteraction`	Plot interaction effects of two predictors in linear regression model
`plotResiduals`	Plot residuals of linear regression model
`plotSlice`	Plot of slices through fitted linear regression surface

Generalized Linear Model Diagnostics

`coefCI`	Confidence intervals of coefficient estimates of generalized linear regression model
`coefTest`	Linear hypothesis test on generalized linear regression model coefficients
`devianceTest`	Analysis of deviance for generalized linear regression model
`plotDiagnostics`	Plot observation diagnostics of generalized linear regression model
`plotResiduals`	Plot residuals of generalized linear regression model
`plotSlice`	Plot of slices through fitted generalized linear regression surface

Nonlinear Model Diagnostics

`coefCI`	Confidence intervals of coefficient estimates of nonlinear regression model
`coefTest`	Linear hypothesis test on nonlinear regression model coefficients
`plotDiagnostics`	Plot diagnostics of nonlinear regression model
`plotSlice`	Plot of slices through fitted nonlinear regression surface

Linear Hypothesis Tests

linhyptest Linear hypothesis test

Objects

expand all

Feature Selection

FeatureSelectionNCARegression Feature selection for regression using neighborhood component analysis (NCA)

Feature Engineering

FeatureTransformer Generated feature transformations (Since R2021a)

Hyperparameter Optimization

BayesianOptimization Bayesian optimization results

Topics

Regression Learner App Workflow

Train Regression Models in Regression Learner App
Workflow for training, comparing and improving regression models, including automated, manual, and parallel training.
Choose Regression Model Options
In Regression Learner, automatically train a selection of models, or compare and tune options of linear regression models, regression trees, support vector machines, Gaussian process regression models, kernel approximation models, ensembles of regression trees, and regression neural networks.
Feature Selection and Feature Transformation Using Regression Learner App
Identify useful predictors using plots or feature ranking algorithms, select features to include, and transform features using PCA in Regression Learner.
Visualize and Assess Model Performance in Regression Learner
Compare model metrics and visualize results.

Feature Selection

Introduction to Feature Selection
Learn about feature selection algorithms and explore the functions available for feature selection.
Sequential Feature Selection
This topic introduces sequential feature selection and provides an example that selects features sequentially using a custom criterion and the sequentialfs function.
Neighborhood Component Analysis (NCA) Feature Selection
Neighborhood component analysis (NCA) is a non-parametric method for selecting features with the goal of maximizing prediction accuracy of regression and classification algorithms.
Robust Feature Selection Using NCA for Regression
Perform feature selection that is robust to outliers using a custom robust loss function in NCA.
Select Predictors for Random Forests
Select split-predictors for random forests using interaction test algorithm.

Feature Engineering

Automated Feature Engineering for Regression
Use genrfeatures to engineer new features before training a regression model. Before making predictions on new data, apply the same feature transformations to the new data set.

Automated Model Selection

Automated Regression Model Selection with Bayesian and ASHA Optimization
Use fitrauto to automatically try a selection of regression model types with different hyperparameter values, given training predictor and response data.

Hyperparameter Optimization

Bayesian Optimization Workflow
Perform Bayesian optimization using a fit function or by calling bayesopt directly.
Variables for a Bayesian Optimization
Create variables for Bayesian optimization.
Bayesian Optimization Objective Functions
Create the objective function for Bayesian optimization.
Constraints in Bayesian Optimization
Set different types of constraints for Bayesian optimization.
Optimize a Boosted Regression Ensemble
Minimize cross-validation loss of a regression ensemble.
Bayesian Optimization Plot Functions
Visually monitor a Bayesian optimization.
Bayesian Optimization Output Functions
Monitor a Bayesian optimization.
Bayesian Optimization Algorithm
Understand the underlying algorithms for Bayesian optimization.
Parallel Bayesian Optimization
How Bayesian optimization works in parallel.

Model Interpretation

Interpret Machine Learning Models
Explain model predictions using the lime and shapley objects and the plotPartialDependence function.
Shapley Values for Machine Learning Model
Compute Shapley values for a machine learning model using interventional algorithm or conditional algorithm.
Shapley Output Functions
Stop Shapley computations, create plots, save information to your workspace, or perform calculations while using shapley.

Cross-Validation

Implement Cross-Validation Using Parallel Computing
Speed up cross-validation using parallel computing.
Perform Time Series Direct Forecasting with directforecaster
Perform time series direct forecasting with the directforecaster function.
Manually Perform Time Series Forecasting Using Ensembles of Boosted Regression Trees
Manually perform single-step and multiple-step time series forecasting with ensembles of boosted regression trees.

Linear Model Diagnostics

Interpret Linear Regression Results
Display and interpret linear regression output statistics.
Linear Regression
Fit a linear regression model and examine the result.
Linear Regression with Interaction Effects
Construct and analyze a linear regression model with interaction effects and interpret the results.
Summary of Output and Diagnostic Statistics
Evaluate a fitted model by using model properties and object functions.
F-statistic and t-statistic
In linear regression, the F-statistic is the test statistic for the analysis of variance (ANOVA) approach to test the significance of the model or the components in the model. The t-statistic is useful for making inferences about the regression coefficients.
Coefficient of Determination (R-Squared)
Coefficient of determination (R-squared) indicates the proportionate amount of variation in the response variable y explained by the independent variables X in the linear regression model.
Coefficient Standard Errors and Confidence Intervals
Estimated coefficient variances and covariances capture the precision of regression coefficient estimates.
Residuals
Residuals are useful for detecting outlying y values and checking the linear regression assumptions with respect to the error term in the regression model.
Durbin-Watson Test
The Durbin-Watson test assesses whether or not there is autocorrelation among the residuals of time series data.
Cook’s Distance
Cook's distance is useful for identifying outliers in the X values (observations for predictor variables).
Hat Matrix and Leverage
The hat matrix provides a measure of leverage.
Delete-1 Statistics
Delete-1 change in covariance (CovRatio) identifies the observations that are influential in the regression fit.

Generalized Linear Model Diagnostics

Generalized Linear Models
Generalized linear models use linear methods to describe a potentially nonlinear relationship between predictor terms and a response variable.

Nonlinear Model Diagnostics

Nonlinear Regression
Parametric nonlinear models represent the relationship between a continuous response variable and one or more continuous predictor variables.