Overfitting

What Is Overfitting?

Overfitting is a machine learning behavior that occurs when the model is so closely aligned to the training data that it does not know how to respond to new data. Overfitting can happen because:

The machine learning model is too complex; it memorizes very subtle patterns in the training data that don’t generalize well.
The training data size is too small for the model complexity and/or contains large amounts of irrelevant information.

You can prevent overfitting by managing model complexity and improving the training data set.

Overfitting vs. Underfitting

Underfitting is the opposite concept of overfitting; the model doesn’t align well with the training data or generalize well to new data. Overfitting and underfitting can be present in both classification and regression models. The following figure illustrates how the classification decision boundary and regression line follow the training data too closely for an overfitted model and not closely enough for an underfitted model.

Plots of data that show overfitting, correct fitting, and underfitting for classification and regression models. — Overfitted classification and regression models memorize the training data too well in comparison with correctly fitted models.

When looking only at the computed error of a machine learning model for the training data, overfitting is harder to detect than underfitting. So, to avoid overfitting, it is important to validate a machine learning model before using it on test data.

Error	Overfitting	Right Fit	Underfitting
Training	Low	Low	High
Test	High	Low	High

Computed error of overfitted models for training data is low, whereas the error is high for test data.

Using MATLAB^® with Statistics and Machine Learning Toolbox™ and Deep Learning Toolbox™, you can prevent overfitting of machine learning and deep learning models. MATLAB provides functions and methods specifically designed to avoid overfitting of models. You can use these tools when you train or tune your model to protect it from overfitting.

How to Avoid Overfitting by Reducing Model Complexity

With MATLAB, you can train machine learning models and deep learning models (such as CNNs) from scratch or take advantage of pretrained deep learning models. To prevent overfitting, perform model validation to ensure that you choose a model with the right level of complexity for your data or use regularization to reduce the complexity of the model.

Model Validation

The error of an overfitted model is low when computed for the training data. It is good practice to validate your model on a separate data set (i.e., validation data set) before introducing new data. For MATLAB machine learning models, you can use the cvpartition function to randomly partition a data set into training and validation sets. For deep learning models, you can monitor the validation accuracy during training. Improving the properly validated accuracy measure for your models through model selection and hyperparameter tuning should translate into improved accuracy when the model sees new data.

Cross-validation is a model assessment technique used to evaluate a machine learning algorithm’s performance in making predictions on data sets it has not been trained on. Cross-validation helps you choose a not overly complex algorithm that will cause overfitting. Use the crossval function to compute the cross-validation error estimate for machine learning models by using common cross-validation techniques, such as k-fold (partitions data into k randomly chosen subsets of roughly equal size) and holdout (partitions data randomly into exactly two subsets of specified ratio).

Regularization

Regularization is a technique used to prevent statistical overfitting in a machine learning model. Regularization algorithms typically work by applying a penalty for either complexity or roughness. By introducing additional information into the model, regularization algorithms can deal with multicollinearity and redundant predictors by making the model more parsimonious and accurate.

For machine learning, you can choose between three popular regularization techniques—lasso (L1 norm), ridge (L2 norm), and elastic net—with several types of linear machine learning models. For deep learning, you can increase the L2 regularization factor in the specified training options or use dropout layers in your network to avoid overfitting.

How to Avoid Overfitting by Enhancing the Training Data Set

Cross-validation and regularization prevent overfitting by managing model complexity. Another approach is to improve the data set. Deep learning models, especially, require large amounts of data to avoid overfitting.

Data Augmentation

When data availability is limited, data augmentation is a method to artificially expand the data points of the training data set by adding randomized versions of the existing data to the data set. With MATLAB, you can augment image, audio, and other types of data. For example, augment image data by randomizing the scale and rotation of existing images.

Data Generation

Synthetic data generation is another method to expand a data set. With MATLAB, you can generate synthetic data by using generative adversarial networks (GANs) or digital twins (data generation through simulation).

Data Cleanup

Data noisiness contributes to overfitting. One common approach to reduce undesired data points is to remove outliers from the data by using the rmoutliers function.

Examples and How To

Classify Data Using the Classification Learner App (4:34) - Video
Forecast Electrical Load Using the Regression Learner App (3:42) - Video
Train Network with Augmented Images - Example
Augment Point Cloud Data for Deep Learning - Example
Generate Synthetic Signals Using Conditional GAN - Example
Set Up Parameters and Train Convolutional Neural Network - Example

Software Reference

Regularization - Documentation
Deep Learning Tips and Tricks - Documentation

Overfitting FAQs

Overfitting is a machine learning behavior that occurs when the model is so closely aligned to the training data that it does not know how to respond to new data, often because the model is too complex or the training data is too small or noisy.

Overfitting occurs when a model follows the training data too closely and doesn’t generalize well to new data, while underfitting occurs when the model doesn’t align well with either the training data or new data.

Computed error of overfitted models for training data is low, whereas the error is high for test data. However, when looking only at the computed error of a machine learning model for the training data, overfitting is harder to detect than underfitting. It is good practice to validate your model on a separate data set (i.e., validation data set) before introducing new data.

Cross-validation is a model assessment technique that evaluates a machine learning algorithm’s performance on data it hasn’t been trained on, helping you choose an algorithm that isn’t overly complex and won’t cause overfitting.

Regularization prevents overfitting by applying a penalty for complexity or roughness, making the model more parsimonious and accurate to deal with multicollinearity and redundant predictors.

MATLAB offers lasso (L1 norm), ridge (L2 norm), and elastic net for machine learning models, and L2 regularization factors or dropout layers for deep learning models.

Data augmentation artificially expands the training data set by adding randomized versions of existing data, which is especially useful when data availability is limited.

Yes, removing outliers and reducing data noisiness helps prevent overfitting, as data noisiness contributes to the model memorizing irrelevant patterns.

Machine Learning Challenges: Choosing the Best Classification Model and Avoiding Overfitting

Online Course

Machine Learning Onramp

Get started