## Handling Missing Data and Outliers

### Handling Missing Data

Data acquisition failures sometimes result in missing measurements both in the input and
the output signals. When you import data that contains missing values using the MATLAB^{®} Import Wizard, these values are automatically set to `NaN`

.
`NaN`

serves as a flag for nonexistent or undefined data. When you plot
data on a time-plot that contains missing values, gaps appear on the plot where missing data
exists.

You can use `misdata`

to estimate missing values. This command
linearly interpolates missing values to estimate the first model. Then, it uses this model
to estimate the missing data as parameters by minimizing the output prediction errors
obtained from the reconstructed data. You can specify the model structure you want to use in
the `misdata`

argument or estimate a default-order model using the
`n4sid`

method. For more information, see the `misdata`

reference page.

**Note**

You can only use `misdata`

on time-domain data stored in an
`iddata`

object. For more information about creating
`iddata`

objects, see Representing Time- and Frequency-Domain Data Using iddata Objects.

For example, suppose `y`

and `u`

are output and input
signals that contain `NaN`

s. This data is sampled at `0.2`

s. The following syntax creates a new `iddata`

object with these input and
output signals.

dat = iddata(y,u,0.2) % y and u contain NaNs % representing missing data

Apply the `misdata`

command to the new data object. For example:

dat1 = misdata(dat); plot(dat,dat1) % Check how the missing data % was estimated on a time plot

### Handling Outliers

Malfunctions can produce errors in measured values, called
*outliers*. Such outliers might be caused by signal spikes or by
measurement malfunctions. If you do not remove outliers from your data, this can adversely
affect the estimated models.

To identify the presence of outliers, perform one of the following tasks:

Before estimating a model, plot the data on a time plot and identify values that appear out of range.

After estimating a model, plot the residuals and identify unusually large values. For more information about plotting residuals, see topics on the Residual Analysis page. Evaluate the original data that is responsible for large residuals. For example, for the model

`Model`

and validation data`Data`

, you can use the following commands to plot the residuals:

% Compute the residuals E = resid(Data,Model) % Plot the residuals plot(E)

Next, try these techniques for removing or minimizing the effects of outliers:

Extract the informative data portions into segments and merge them into one multiexperiment data set (see Extract and Model Specific Data Segments). For more information about selecting and extracting data segments, see Select Subsets of Data.

**Tip**The inputs in each of the data segments must be consistently exciting the system. Splitting data into meaningful segments for steady-state data results in minimum information loss. Avoid making data segments too small.

Manually replace outliers with

`NaN`

s and then use the`misdata`

command to reconstruct flagged data. This approach treats outliers as missing data and is described in Handling Missing Data. Use this method when your data contains several inputs and outputs, and when you have difficulty finding reliable data segments in all variables.Remove outliers by prefiltering the data for high-frequency content because outliers often result from abrupt changes. For more information about filtering, see Filtering Data.

**Note**

The estimation algorithm can handle outliers by assigning a smaller weight to outlier
data. A robust error criterion applies an error penalty that is quadratic for small and
moderate prediction errors, and is linear for large prediction errors. Because outliers
produce large prediction errors, this approach gives a smaller weight to the corresponding
data points during model estimation. Set the `ErrorThreshold`

estimation
option (see `Advanced.ErrorThreshold`

in, for example, `polyestOptions`

) to a nonzero value to activate the correction for outliers
in the estimation algorithm.

### See Also

To learn more about the theory of handling missing data and outliers, see the chapter on
preprocessing data in *System Identification: Theory for the User*,
Second Edition, by Lennart Ljung, Prentice Hall PTR, 1999.