## Linear Regression

### Introduction

A data *model* explicitly describes a relationship between
*predictor* and *response* variables.
Linear regression fits a data model that is linear in the model coefficients. The
most common type of linear regression is a *least-squares fit*,
which can fit both lines and polynomials, among other linear models.

Before you model the relationship between pairs of quantities, it is a good idea to perform correlation analysis to establish if a linear relationship exists between these quantities. Be aware that variables can have nonlinear relationships, which correlation analysis cannot detect. For more information, see Linear Correlation.

The MATLAB^{®} Basic Fitting UI helps you to fit your data, so you can calculate
model coefficients and plot the model on top of the data. For an example, see Example: Using Basic Fitting UI. You also can use the
MATLAB
`polyfit`

and `polyval`

functions to fit your data
to a model that is linear in the coefficients. For an example, see Programmatic Fitting.

If you need to fit data with a nonlinear model, transform the variables to make
the relationship linear. Alternatively, try to fit a nonlinear function directly
using either the Statistics and Machine Learning Toolbox™
`nlinfit`

function, the Optimization Toolbox™
`lsqcurvefit`

function, or by applying functions in the
Curve Fitting Toolbox™.

This topic explains how to:

Perform simple linear regression using the

`\`

operator.Use correlation analysis to determine whether two quantities are related to justify fitting the data.

Fit a linear model to the data.

Evaluate the goodness of fit by plotting residuals and looking for patterns.

Calculate measures of goodness of fit R

^{2}and adjusted R^{2}

### Simple Linear Regression

This example shows how to perform simple linear regression using the `accidents`

dataset. The example also shows you how to calculate the coefficient of determination $${R}^{2}$$ to evaluate the regressions. The `accidents`

dataset contains data for fatal traffic accidents in U.S. states.

Linear regression models the relation between a dependent, or response, variable $$y$$ and one or more independent, or predictor, variables $${x}_{1},...,{x}_{n}$$. Simple linear regression considers only one independent variable using the relation

$$y={\beta}_{0}+{\beta}_{1}x+\u03f5,$$

where $${\beta}_{0}$$ is the y-intercept, $${\beta}_{1}$$ is the slope (or regression coefficient), and $$\u03f5$$ is the error term.

Start with a set of $$n$$ observed values of $$x$$ and $$y$$ given by $$({x}_{1},{y}_{1})$$, $$({x}_{2},{y}_{2})$$, ..., $$({x}_{n},{y}_{n})$$. Using the simple linear regression relation, these values form a system of linear equations. Represent these equations in matrix form as

$$\left[\begin{array}{c}{y}_{1}\\ {y}_{2}\\ \vdots \\ {y}_{n}\end{array}\right]=\left[\begin{array}{cc}1& {x}_{1}\\ 1& {x}_{2}\\ \vdots & \vdots \\ 1& {x}_{n}\end{array}\right]\left[\begin{array}{c}{\beta}_{0}\\ {\beta}_{1}\end{array}\right].$$

Let

$$Y=\left[\begin{array}{c}{y}_{1}\\ {y}_{2}\\ \vdots \\ {y}_{n}\end{array}\right],X=\left[\begin{array}{cc}1& {x}_{1}\\ 1& {x}_{2}\\ \vdots & \vdots \\ 1& {x}_{n}\end{array}\right],B=\left[\begin{array}{c}{\beta}_{0}\\ {\beta}_{1}\end{array}\right].$$

The relation is now $$Y=XB$$.

In MATLAB, you can find $$B$$ using the `mldivide`

operator as `B = X\Y`

.

From the dataset `accidents`

, load accident data in `y`

and state population data in `x`

. Find the linear regression relation $$y={\beta}_{1}x$$ between the accidents in a state and the population of a state using the `\`

operator. The `\`

operator performs a least-squares regression.

load accidents x = hwydata(:,14); %Population of states y = hwydata(:,4); %Accidents per state format long b1 = x\y

b1 = 1.372716735564871e-04

`b1`

is the slope or regression coefficient. The linear relation is $$y={\beta}_{1}x=0.0001372x$$.

Calculate the accidents per state `yCalc`

from `x`

using the relation. Visualize the regression by plotting the actual values `y`

and the calculated values `yCalc`

.

yCalc1 = b1*x; scatter(x,y) hold on plot(x,yCalc1) xlabel('Population of state') ylabel('Fatal traffic accidents per state') title('Linear Regression Relation Between Accidents & Population') grid on

Improve the fit by including a y-intercept $${\beta}_{0}$$ in your model as $$y={\beta}_{0}+{\beta}_{1}x$$. Calculate $${\beta}_{0}$$ by padding `x`

with a column of ones and using the `\`

operator.

X = [ones(length(x),1) x]; b = X\y

b =2×110^{2}× 142.7120 0.0001

This result represents the relation $$y={\beta}_{0}+{\beta}_{1}x=142.7120+0.0001256x$$.

Visualize the relation by plotting it on the same figure.

yCalc2 = X*b; plot(x,yCalc2,'--') legend('Data','Slope','Slope & Intercept','Location','best');

From the figure, the two fits look similar. One method to find the better fit is to calculate the coefficient of determination, $${R}^{2}$$. $${R}^{2}$$ is one measure of how well a model can predict the data, and falls between $$0$$ and $$1$$. The higher the value of $${R}^{2}$$, the better the model is at predicting the data.

Where $$\underset{}{\overset{\u02c6}{y}}$$ represents the calculated values of $$y$$ and $$\underset{}{\overset{\u203e}{y}}$$ is the mean of $$y$$, $${R}^{2}$$ is defined as

$${R}^{2}=1-\frac{\sum _{i=1}^{n}{({y}_{i}-{\underset{}{\overset{\u02c6}{y}}}_{i})}^{2}}{\sum _{i=1}^{n}{({y}_{i}-\underset{}{\overset{\u203e}{y}})}^{2}}.$$

Find the better fit of the two fits by comparing values of $${R}^{2}$$. As the $${R}^{2}$$ values show, the second fit that includes a y-intercept is better.

Rsq1 = 1 - sum((y - yCalc1).^2)/sum((y - mean(y)).^2)

Rsq1 = 0.822235650485566

Rsq2 = 1 - sum((y - yCalc2).^2)/sum((y - mean(y)).^2)

Rsq2 = 0.838210531103428

### Residuals and Goodness of Fit

Residuals are the difference between the *observed* values of
the response (dependent) variable and the values that a model
*predicts*. When you fit a model that is appropriate for your
data, the residuals approximate independent random errors. That is, the distribution
of residuals ought not to exhibit a discernible pattern.

Producing a fit using a linear model requires minimizing the sum of the squares of the residuals. This minimization yields what is called a least-squares fit. You can gain insight into the “goodness” of a fit by visually examining a plot of the residuals. If the residual plot has a pattern (that is, residual data points do not appear to have a random scatter), the randomness indicates that the model does not properly fit the data.

Evaluate each fit you make in the context of your data. For example, if your goal of fitting the data is to extract coefficients that have physical meaning, then it is important that your model reflect the physics of the data. Understanding what your data represents, how it was measured, and how it is modeled is important when evaluating the goodness of fit.

One measure of goodness of fit is the *coefficient of
determination*, or R^{2} (pronounced
r-square). This statistic indicates how closely values you obtain from fitting a
model match the dependent variable the model is intended to predict. Statisticians
often define R^{2} using the residual variance from a fitted
model:

R^{2} = 1 – SS_{resid} /
SS_{total}

*SS _{resid}* is the sum of the squared
residuals from the regression.

*SS*is the sum of the squared differences from the mean of the dependent variable (

_{total}*total sum of squares*). Both are positive scalars.

To learn how to compute R^{2} when you use the Basic
Fitting tool, see R2, the Coefficient of Determination. To learn more about calculating the R^{2} statistic and
its multivariate generalization, continue reading here.

#### Example: Computing R^{2} from Polynomial Fits

You can derive R^{2} from the coefficients of a
polynomial regression to determine how much variance in `y`

a
linear model explains, as the following example describes:

Create two variables,

`x`

and`y`

, from the first two columns of the`count`

variable in the data file`count.dat`

:load count.dat x = count(:,1); y = count(:,2);

Use

`polyfit`

to compute a linear regression that predicts`y`

from`x`

:p = polyfit(x,y,1) p = 1.5229 -2.1911

`p(1)`

is the slope and`p(2)`

is the intercept of the linear predictor. You can also obtain regression coefficients using the Basic Fitting UI.Call

`polyval`

to use`p`

to predict`y`

, calling the result`yfit`

:yfit = polyval(p,x);

Using

`polyval`

saves you from typing the fit equation yourself, which in this case looks like:yfit = p(1) * x + p(2);

Compute the residual values as a vector of signed numbers:

yresid = y - yfit;

Square the residuals and total them to obtain the residual sum of squares:

SSresid = sum(yresid.^2);

Compute the total sum of squares of

`y`

by multiplying the variance of`y`

by the number of observations minus`1`

:SStotal = (length(y)-1) * var(y);

Compute R

^{2}using the formula given in the introduction of this topic:This demonstrates that the linear equationrsq = 1 - SSresid/SStotal rsq = 0.8707

`1.5229 * x -2.1911`

predicts 87% of the variance in the variable`y`

.

#### Computing Adjusted R^{2} for Polynomial Regressions

You can usually reduce the residuals in a model by fitting a higher degree
polynomial. When you add more terms, you increase the coefficient of
determination, R^{2}. You get a closer fit to the data,
but at the expense of a more complex model, for which
R^{2} cannot account. However, a refinement of this
statistic, *adjusted R ^{2}*, does
include a penalty for the number of terms in a model. Adjusted
R

^{2}, therefore, is more appropriate for comparing how different models fit to the same data. The adjusted R

^{2}is defined as:

R^{2}_{adjusted} = 1
- (SS_{resid} /
SS_{total})*((*n*-1)/(*n*-*d*-1))

*n*is the number of observations in your data, and

*d*is the degree of the polynomial. (A linear fit has a degree of 1, a quadratic fit 2, a cubic fit 3, and so on.)

The following example repeats the steps of the previous example, Example: Computing R2 from Polynomial Fits, but performs a cubic (degree 3) fit instead of a linear (degree 1) fit. From
the cubic fit, you compute both simple and adjusted
R^{2} values to evaluate whether the extra terms
improve predictive power:

Create two variables,

`x`

and`y`

, from the first two columns of the`count`

variable in the data file`count.dat`

:load count.dat x = count(:,1); y = count(:,2);

Call

`polyfit`

to generate a cubic fit to predict`y`

from`x`

:p = polyfit(x,y,3) p = -0.0003 0.0390 0.2233 6.2779

`p(4)`

is the intercept of the cubic predictor. You can also obtain regression coefficients using the Basic Fitting UI.Call

`polyval`

to use the coefficients in`p`

to predict`y`

, naming the result`yfit`

:yfit = polyval(p,x);

`polyval`

evaluates the explicit equation you could manually enter as:yfit = p(1) * x.^3 + p(2) * x.^2 + p(3) * x + p(4);

Compute the residual values as a vector of signed numbers:

yresid = y - yfit;

Square the residuals and total them to obtain the residual sum of squares:

SSresid = sum(yresid.^2);

Compute the total sum of squares of

`y`

by multiplying the variance of`y`

by the number of observations minus`1`

:SStotal = (length(y)-1) * var(y);

Compute simple R

^{2}for the cubic fit using the formula given in the introduction of this topic:rsq = 1 - SSresid/SStotal rsq = 0.9083

Finally, compute adjusted R

^{2}to account for degrees of freedom:The adjusted Rrsq_adj = 1 - SSresid/SStotal * (length(y)-1)/(length(y)-length(p)) rsq_adj = 0.8945

^{2}, 0.8945, is smaller than simple R^{2}, .9083. It provides a more reliable estimate of the power of your polynomial model to predict.

In many polynomial regression models, adding terms to the equation increases
both R^{2} and adjusted R^{2}.
In the preceding example, using a cubic fit increased both statistics compared
to a linear fit. (You can compute adjusted R^{2} for the
linear fit for yourself to demonstrate that it has a lower value.) However, it
is not always true that a linear fit is worse than a higher-order fit: a more
complicated fit can have a lower adjusted R^{2} than a
simpler fit, indicating that the increased complexity is not justified. Also,
while R^{2} always varies between 0 and 1 for the
polynomial regression models that the Basic Fitting tool generates, adjusted
R^{2} for some models can be negative, indicating
that a model that has too many terms.

Correlation does not imply causality. Always interpret coefficients of correlation and determination cautiously. The coefficients only quantify how much variance in a dependent variable a fitted model removes. Such measures do not describe how appropriate your model—or the independent variables you select—are for explaining the behavior of the variable the model predicts.

### Fitting Data with Curve Fitting Toolbox Functions

The Curve Fitting Toolbox software extends core MATLAB functionality by enabling the following data-fitting capabilities:

Linear and nonlinear parametric fitting, including standard linear least squares, nonlinear least squares, weighted least squares, constrained least squares, and robust fitting procedures

Nonparametric fitting

Statistics for determining the goodness of fit

Extrapolation, differentiation, and integration

Dialog box that facilitates data sectioning and smoothing

Saving fit results in various formats, including MATLAB code files, MAT-files, and workspace variables

For more information, see the Curve Fitting Toolbox documentation.