regress

Multiple linear regression

collapse all in page

Syntax

b = regress(y,X)

[b,bint] = regress(y,X)

[b,bint,r] = regress(y,X)

[b,bint,r,rint] = regress(y,X)

[b,bint,r,rint,stats] = regress(y,X)

[___] = regress(y,X,alpha)

Description

example

b = regress(y,X) returns a vector b of coefficient estimates for a multiple linear regression of the responses in vector y on the predictors in matrix X. To compute coefficient estimates for a model with a constant term (intercept), include a column of ones in the matrix X.

[b,bint] = regress(y,X) also returns a matrix bint of 95% confidence intervals for the coefficient estimates.

[b,bint,r] = regress(y,X) also returns an additional vector r of residuals.

example

[b,bint,r,rint] = regress(y,X) also returns a matrix rint of intervals that can be used to diagnose outliers.

example

[b,bint,r,rint,stats] = regress(y,X) also returns a vector stats that contains the R² statistic, the F-statistic and its p-value, and an estimate of the error variance. The matrix X must include a column of ones for the software to compute the model statistics correctly.

example

[___] = regress(y,X,alpha) uses a 100*(1-alpha)% confidence level to compute bint and rint. Specify any of the output argument combinations in the previous syntaxes.

Examples

collapse all

Estimate Multiple Linear Regression Coefficients

Open Live Script

Load the carsmall data set. Identify weight and horsepower as predictors and mileage as the response.

load carsmall
x1 = Weight;
x2 = Horsepower;    % Contains NaN data
y = MPG;

Compute the regression coefficients for a linear model with an interaction term.

X = [ones(size(x1)) x1 x2 x1.*x2];
b = regress(y,X)    % Removes NaN data

Plot the data and the model.

scatter3(x1,x2,y,'filled')
hold on
x1fit = min(x1):100:max(x1);
x2fit = min(x2):10:max(x2);
[X1FIT,X2FIT] = meshgrid(x1fit,x2fit);
YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT;
mesh(X1FIT,X2FIT,YFIT)
xlabel('Weight')
ylabel('Horsepower')
zlabel('MPG')
view(50,10)
hold off

Diagnose Outliers Using Residuals

Open Live Script

Load the examgrades data set.

load examgrades

Use the last exam scores as response data and the first two exam scores as predictor data.

y = grades(:,5);
X = [ones(size(grades(:,1))) grades(:,1:2)];

Perform multiple linear regression with alpha = 0.01.

[~,~,r,rint] = regress(y,X,0.01);

Diagnose outliers by finding the residual intervals rint that do not contain 0.

contain0 = (rint(:,1)<0 & rint(:,2)>0);
idx = find(contain0==false)

idx = 2×1

    53
    54

Observations 53 and 54 are possible outliers.

Create a scatter plot of the residuals. Fill in the points corresponding to the outliers.

hold on
scatter(y,r)
scatter(y(idx),r(idx),'b','filled')
xlabel("Last Exam Grades")
ylabel("Residuals")
hold off

Determine Significance of Linear Regression Relationship

Open Live Script

Load the hald data set. Use heat as the response variable and ingredients as the predictor data.

load hald
y = heat;
X1 = ingredients;
x1 = ones(size(X1,1),1);
X = [x1 X1];    % Includes column of ones

Perform multiple linear regression and generate model statistics.

[~,~,~,~,stats] = regress(y,X)

stats = 1×4

    0.9824  111.4792    0.0000    5.9830

Because the $R^{2}$ value of 0.9824 is close to 1, and the p-value of 0.0000 is less than the default significance level of 0.05, a significant linear regression relationship exists between the response y and the predictor variables in X.

Input Arguments

collapse all

`y` — Response data
numeric vector

Response data, specified as an n-by-1 numeric vector. Rows of y correspond to different observations. y must have the same number of rows as X.

Data Types: single | double

`X` — Predictor data
numeric matrix

Predictor data, specified as an n-by-p numeric matrix. Rows of X correspond to observations, and columns correspond to predictor variables. X must have the same number of rows as y.

Data Types: single | double

`alpha` — Significance level
`0.05` (default) | positive scalar

Significance level, specified as a positive scalar. alpha must be between 0 and 1.

Data Types: single | double

Output Arguments

collapse all

`b` — Coefficient estimates for multiple linear regression
numeric vector

Coefficient estimates for multiple linear regression, returned as a numeric vector. b is a p-by-1 vector, where p is the number of predictors in X. If the columns of X are linearly dependent, regress sets the maximum number of elements of b to zero.

Data Types: double

`bint` — Lower and upper confidence bounds for coefficient estimates
numeric matrix

Lower and upper confidence bounds for coefficient estimates, returned as a numeric matrix. bint is a p-by-2 matrix, where p is the number of predictors in X. The first column of bint contains lower confidence bounds for each of the coefficient estimates; the second column contains upper confidence bounds. If the columns of X are linearly dependent, regress returns zeros in elements of bint corresponding to the zero elements of b.

Data Types: double

`r` — Residuals
numeric vector

Residuals, returned as a numeric vector. r is an n-by-1 vector, where n is the number of observations, or rows, in X.

Data Types: single | double

`rint` — Intervals to diagnose outliers
numeric matrix

Intervals to diagnose outliers, returned as a numeric matrix. rint is an n-by-2 matrix, where n is the number of observations, or rows, in X. If the interval rint(i,:) for observation i does not contain zero, the corresponding residual is larger than expected in 100*(1-alpha)% of new observations, suggesting an outlier. For more information, see Algorithms.

Data Types: single | double

`stats` — Model statistics
numeric vector

Model statistics, returned as a numeric vector including the R² statistic, the F-statistic and its p-value, and an estimate of the error variance.

X must include a column of ones so that the model contains a constant term. The F-statistic and its p-value are computed under this assumption and are not correct for models without a constant.
The F-statistic is the test statistic of the F-test on the regression model. The F-test looks for a significant linear regression relationship between the response variable and the predictor variables.
The R² statistic can be negative for models without a constant, indicating that the model is not appropriate for the data.

Data Types: single | double

Tips

regress treats NaN values in X or y as missing values. regress omits observations with missing values from the regression fit.

Algorithms

collapse all

Residual Intervals

In a linear model, observed values of y and their residuals are random variables. Residuals have normal distributions with zero mean but with different variances at different values of the predictors. To put residuals on a comparable scale, regress “Studentizes” the residuals. That is, regress divides the residuals by an estimate of their standard deviation that is independent of their value. Studentized residuals have t-distributions with known degrees of freedom. The intervals returned in rint are shifts of the 100*(1-alpha)% confidence intervals of these t-distributions, centered at the residuals.

Alternative Functionality

regress is useful when you simply need the output arguments of the function and when you want to repeat fitting a model multiple times in a loop. If you need to investigate a fitted regression model further, create a linear regression model object LinearModel by using fitlm or stepwiselm. A LinearModel object provides more features than regress.

Use the properties of LinearModel to investigate a fitted linear regression model. The object properties include information about coefficient estimates, summary statistics, fitting method, and input data.
Use the object functions of LinearModel to predict responses and to modify, evaluate, and visualize the linear regression model.
Unlike regress, the fitlm function does not require a column of ones in the input data. A model created by fitlm always includes an intercept term unless you specify not to include it by using the 'Intercept' name-value pair argument.

You can find the information in the output of regress using the properties and object functions of LinearModel.

Output of `regress`	Equivalent Values in `LinearModel`
`b`	See the `Estimate` column of the `Coefficients` property.
`bint`	Use the `coefCI` function.
`r`	See the `Raw` column of the `Residuals` property.
`rint`	Not supported. Instead, use studentized residuals (`Residuals` property) and observation diagnostics (`Diagnostics` property) to find outliers.
`stats`	See the model display in the Command Window. You can find the statistics in the model properties (`MSE` and `Rsquared`) and by using the `anova` function.

References

[1] Chatterjee, S., and A. S. Hadi. “Influential Observations, High Leverage Points, and Outliers in Linear Regression.” Statistical Science. Vol. 1, 1986, pp. 379–416.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced before R2006a

regress

Syntax

Description

Examples

Estimate Multiple Linear Regression Coefficients

Diagnose Outliers Using Residuals

Determine Significance of Linear Regression Relationship

Input Arguments

`y` — Response data
numeric vector

`X` — Predictor data
numeric matrix

`alpha` — Significance level
`0.05` (default) | positive scalar

Output Arguments

`b` — Coefficient estimates for multiple linear regression
numeric vector

`bint` — Lower and upper confidence bounds for coefficient estimates
numeric matrix

`r` — Residuals
numeric vector

`rint` — Intervals to diagnose outliers
numeric matrix

`stats` — Model statistics
numeric vector

Tips

Algorithms

Residual Intervals

Alternative Functionality

References

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

regress

Syntax

Description

Examples

Estimate Multiple Linear Regression Coefficients

Diagnose Outliers Using Residuals

Determine Significance of Linear Regression Relationship

Input Arguments

y — Response data numeric vector

X — Predictor data numeric matrix

alpha — Significance level 0.05 (default) | positive scalar

Output Arguments

b — Coefficient estimates for multiple linear regression numeric vector

bint — Lower and upper confidence bounds for coefficient estimates numeric matrix

r — Residuals numeric vector

rint — Intervals to diagnose outliers numeric matrix

stats — Model statistics numeric vector

Tips

Algorithms

Residual Intervals

Alternative Functionality

References

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Topics

`y` — Response data
numeric vector

`X` — Predictor data
numeric matrix

`alpha` — Significance level
`0.05` (default) | positive scalar

`b` — Coefficient estimates for multiple linear regression
numeric vector

`bint` — Lower and upper confidence bounds for coefficient estimates
numeric matrix

`r` — Residuals
numeric vector

`rint` — Intervals to diagnose outliers
numeric matrix

`stats` — Model statistics
numeric vector

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.