isoutlier
Find outliers in data
Syntax
Description
returns a
logical array whose elements are TF
= isoutlier(A
)true
when an outlier is detected
in the corresponding element of A
.
If
A
is a matrix, thenisoutlier
operates on each column ofA
separately.If
A
is a multidimensional array, thenisoutlier
operates along the first dimension ofA
whose size does not equal 1.If
A
is a table or timetable, thenisoutlier
operates on each variable ofA
separately.
By default, an outlier is a value that is more than three scaled median absolute deviations (MAD) from the median.
specifies additional parameters for detecting outliers using one or more namevalue
arguments. For example, TF
= isoutlier(___,Name,Value
)isoutlier(A,"SamplePoints",t)
detects
outliers in array A
relative to the corresponding elements of a
time vector t
.
Examples
Detect Outliers in Vector
Find the outliers in a vector of data. A logical 1 in the output indicates the location of an outlier.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57]; TF = isoutlier(A)
TF = 1x15 logical array
0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
Use Mean Detection Method
Define outliers as points more than three standard deviations from the mean, and find the locations of outliers in a vector.
A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
TF = isoutlier(A,"mean")
TF = 1x15 logical array
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
Use Moving Detection Method
Use a moving detection method to detect local outliers in a sine wave that corresponds to a time vector.
Create a vector of data containing a local outlier.
x = 2*pi:0.1:2*pi; A = sin(x); A(47) = 0;
Create a time vector that corresponds to the data in A
.
t = datetime(2017,1,1,0,0,0) + hours(0:length(x)1);
Define outliers as points more than three local scaled MAD from the local median within a sliding window. Find the locations of the outliers in A
relative to the points in t
with a window size of 5 hours. Plot the data and detected outliers.
TF = isoutlier(A,"movmedian",hours(5),"SamplePoints",t); plot(t,A) hold on plot(t(TF),A(TF),"x") legend("Original Data","Outlier Data")
Detect Outliers in Matrix
Find outliers for each row of a matrix.
Create a matrix of data containing outliers along the diagonal.
A = magic(5) + diag(200*ones(1,5))
A = 5×5
217 24 1 8 15
23 205 7 14 16
4 6 213 20 22
10 12 19 221 3
11 18 25 2 209
Find the locations of outliers based on the data in each row.
TF = isoutlier(A,2)
TF = 5x5 logical array
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
Visualize Outlier Thresholds
Locate an outlier in a vector of data and visualize the outlier.
Create a vector of data containing a local outlier.
x = 1:10; A = [60 59 49 49 58 100 61 57 48 58];
Locate the outlier using the default detection method "median"
.
[TF,L,U,C] = isoutlier(A);
Plot the original data, the outlier, and the thresholds and center value determined by the detection method. The center value is the median of the data, and the upper and lower thresholds are three scaled MAD above and below the median.
plot(x,A) hold on plot(x(TF),A(TF),"x") yline([L U C],":",["Lower Threshold","Upper Threshold","Center Value"]) legend("Original Data","Outlier Data")
Input Arguments
A
— Input data
vector  matrix  multidimensional array  table  timetable
Input data, specified as a vector, matrix, multidimensional array, table, or timetable.
If
A
is a table, then its variables must be of typedouble
orsingle
, or you can use theDataVariables
argument to listdouble
orsingle
variables explicitly. Specifying variables is useful when you are working with a table that contains variables with data types other thandouble
orsingle
.If
A
is a timetable, thenisoutlier
operates only on the table elements. If row times are used as sample points, then they must be unique and listed in ascending order.
Data Types: double
 single
 table
 timetable
method
— Method for detecting outliers
"median"
(default)  "mean"
 "quartiles"
 "grubbs"
 "gesd"
Method for detecting outliers, specified as one of these values.
Method  Description 

"median"  Outliers are defined as elements more than three
scaled MAD from the median. The scaled MAD is defined as
c*median(abs(Amedian(A))) , where
c=1/(sqrt(2)*erfcinv(3/2)) . 
"mean"  Outliers are defined as elements more than three
standard deviations from the mean. This method is faster
but less robust than
"median" . 
"quartiles"  Outliers are defined as elements more than 1.5
interquartile ranges above the upper quartile (75
percent) or below the lower quartile (25 percent). This
method is useful when the data in A
is not normally distributed. 
"grubbs"  Outliers are detected using Grubbs’ test for
outliers, which removes one outlier per iteration based
on hypothesis testing. This method assumes that the data
in A is normally distributed. 
"gesd"  Outliers are detected using the generalized extreme
Studentized deviate test for outliers. This iterative
method is similar to "grubbs" , but
can perform better when there are multiple outliers
masking each other. 
threshold
— Percentile thresholds
twoelement row vector
Percentile thresholds, specified as a twoelement row vector whose
elements are in the interval [0, 100]. The first element indicates the lower
percentile threshold, and the second element indicates the upper percentile
threshold. The first element of threshold
must be less
than the second element.
For example, a threshold of [10 90]
defines outliers as
points below the 10th percentile and above the 90th percentile.
movmethod
— Moving method
"movmedian"
 "movmean"
Moving method for detecting outliers, specified as one of these values.
Method  Description 

"movmedian"  Outliers are defined as elements more than three
local scaled MAD from the local median over a window
length specified by window . This
method is also known as a Hampel
filter. 
"movmean"  Outliers are defined as elements more than three
local standard deviations from the local mean over a
window length specified by
window . 
window
— Window length
positive integer scalar  twoelement vector of positive integers  positive duration scalar  twoelement vector of positive durations
Window length, specified as a positive integer scalar, a twoelement vector of positive integers, a positive duration scalar, or a twoelement vector of positive durations.
When window
is a positive integer scalar, the window is centered about the
current element and contains window1
neighboring
elements. If window
is even, then the window is centered
about the current and previous elements.
When window
is a twoelement vector of positive
integers [b f]
, the window contains the current element,
b
elements backward, and f
elements forward.
When A
is a timetable or SamplePoints
is specified as a
datetime
or duration
vector,
window
must be of type duration
,
and the windows are computed relative to the sample points.
dim
— Operating dimension
positive integer scalar
Operating dimension, specified as a positive integer scalar. If no value is specified, then the default is the first array dimension whose size does not equal 1.
Consider an m
byn
input matrix,
A
:
isoutlier(A,1)
detects outliers based on the data in each column ofA
and returns anm
byn
matrix.isoutlier(A,2)
detects outliers based on the data in each row ofA
and returns anm
byn
matrix.
For table or timetable input data, dim
is not supported
and operation is along each table or timetable variable separately.
NameValue Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Namevalue arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: isoutlier(A,"mean",ThresholdFactor=4)
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: isoutlier(A,"mean","ThresholdFactor",4)
SamplePoints
— Sample points
vector  table variable name  scalar  function handle  table vartype
subscript
Sample points, specified as a vector of sample point values or one of
the options in the following table when the input data is a table. The
sample points represent the xaxis locations of the
data, and must be sorted and contain unique elements. Sample points do
not need to be uniformly sampled. The vector [1 2 3
...]
is the default.
When the input data is a table, you can specify the sample points as a table variable using one of these options.
Indexing Scheme  Examples 

Variable name:


Variable index:


Function handle:


Variable type:


Note
This namevalue argument is not supported when the input data is a
timetable
. Timetables use the vector of row times as the sample
points. To use different sample points, you must edit the timetable so that the row times
contain the desired sample points.
Moving windows are defined relative to the sample points. For example,
if t
is a vector of times corresponding to the input
data, then
isoutlier(rand(1,10),"movmean",3,"SamplePoints",t)
has a window that represents the time interval between
t(i)1.5
and t(i)+1.5
.
When the sample points vector has data type
datetime
or duration
, the
moving window length must have type duration
.
Example: isoutlier(A,"SamplePoints",0:0.1:10)
Example: isoutlier(T,"SamplePoints","Var1")
Data Types: single
 double
 datetime
 duration
DataVariables
— Table variables to operate on
table variable name  scalar  vector  cell array  pattern  function handle  table vartype
subscript
Table variables to operate on, specified as one of the options in this
table. The DataVariables
value indicates which
variables of the input table to examine for outliers. The data type
associated with the indicated variables must be
double
or single
.
The first output TF
contains
false
for variables not specified by
DataVariables
unless the value of
OutputFormat
is
"tabular"
.
Indexing Scheme  Examples 

Variable names:


Variable index:


Function handle:


Variable type:


Example: isoutlier(T,"DataVariables",["Var1" "Var2"
"Var4"])
OutputFormat
— Output data type
"logical"
(default)  "tabular"
Output data type, specified as one of these values:
"logical"
— For table or timetable input data, return the outputTF
as a logical array."tabular"
— For table input data, return the outputTF
as a table. For timetable input data, return the outputTF
as a timetable.
For vector, matrix, or multidimensional array input data,
OutputFormat
is not supported.
Example: isoutlier(T,"OutputFormat","tabular")
ThresholdFactor
— Detection threshold factor
nonnegative scalar
Detection threshold factor, specified as a nonnegative scalar.
For methods "median"
and
"movmedian"
, the detection threshold factor
replaces the number of scaled MAD, which is 3 by default.
For methods "mean"
and
"movmean"
, the detection threshold factor replaces
the number of standard deviations from the mean, which is 3 by
default.
For methods "grubbs"
and "gesd"
, the detection
threshold factor is a scalar ranging from 0 to 1. Values close to 0
result in a smaller number of outliers, and values close to 1 result in
a larger number of outliers. The default detection threshold factor is
0.05.
For the "quartiles"
method, the detection threshold factor replaces the
number of interquartile ranges, which is 1.5 by default.
This namevalue argument is not supported when the specified method is
"percentiles"
.
MaxNumOutliers
— Maximum outlier count
positive integer scalar
Maximum outlier count, for the "gesd"
method only,
specified as a positive integer scalar. The
MaxNumOutliers
value specifies the maximum number
of outliers returned by the "gesd"
method. For
example, isoutlier(A,"gesd","MaxNumOutliers",5)
returns no more than five outliers.
The default value for MaxNumOutliers
is the integer
nearest to 10 percent of the number of elements in A
.
Setting a larger value for the maximum number of outliers makes it more
likely that all outliers are detected but at the cost of reduced
computational efficiency.
The "gesd"
method assumes the nonoutlier input data
is sampled from an approximate normal distribution. When the data is not
sampled in this way, the number of returned outliers might exceed the
MaxNumOutliers
value.
Output Arguments
TF
— Outlier indicator
vector  matrix  multidimensional array  table  timetable
Outlier indicator, returned as a vector, matrix, multidimensional array, table, or timetable.
TF
is the same size as A
unless the
value of OutputFormat
is "tabular"
. If
the value of OutputFormat
is
"tabular"
, then TF
has only variables
corresponding to the DataVariables
specified.
Data Types: logical
L
— Lower threshold
scalar  vector  matrix  multidimensional array  table  timetable
Lower threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the lower threshold value of the default outlier detection method is three scaled MAD below the median of the input data.
If method
is used for outlier detection, then
L
has the same size as A
in all
dimensions except for the operating dimension where the length is 1. If
movmethod
is used, then L
has the
same size as A
.
Data Types: double
 single
 table
 timetable
U
— Upper threshold
scalar  vector  matrix  multidimensional array  table  timetable
Upper threshold used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the upper threshold value of the default outlier detection method is three scaled MAD above the median of the input data.
If method
is used for outlier detection, then
U
has the same size as A
in all
dimensions except for the operating dimension where the length is 1. If
movmethod
is used, then U
has the
same size as A
.
Data Types: double
 single
 table
 timetable
C
— Center value
scalar  vector  matrix  multidimensional array  table  timetable
Center value used by the outlier detection method, returned as a scalar, vector, matrix, multidimensional array, table, or timetable. For example, the center value of the default outlier detection method is the median of the input data.
If method
is used for outlier detection, then
C
has the same size as A
in all
dimensions except for the operating dimension where the length is 1. If
movmethod
is used, then C
has the
same size as A
.
Data Types: double
 single
 table
 timetable
More About
Median Absolute Deviation
For a finitelength vector A made up of N scalar observations, the median absolute deviation (MAD) is defined as
$$\text{MAD=median}\left({A}_{i}\text{median}\left(A\right)\right)$$
for i = 1,2,...,N.
The scaled MAD is defined as c*median(abs(Amedian(A)))
, where
c=1/(sqrt(2)*erfcinv(3/2))
.
References
[1] NIST/SEMATECH eHandbook of Statistical Methods, https://www.itl.nist.gov/div898/handbook/, 2013.
Extended Capabilities
Tall Arrays
Calculate with arrays that have more rows than fit in memory.
Usage notes and limitations:
The
"percentiles"
,"grubbs"
, and"gesd"
methods are not supported.The
"movmedian"
and"movmean"
methods do not support tall timetables.The
SamplePoints
andMaxNumOutliers
namevalue arguments are not supported.The value of
DataVariables
cannot be a function handle.Computation of
isoutlier(A)
,isoutlier(A,"median",...)
, orisoutlier(A,"quartiles",...)
along the first dimension is supported only for tall column vectorsA
.
For more information, see Tall Arrays.
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
Usage notes and limitations:
The
"movmean"
and"movmedian"
methods for detecting outliers do not support timetable input data, datetimeSamplePoints
values, or durationSamplePoints
values.String and character array inputs must be constant.
ThreadBased Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
This function fully supports threadbased environments. For more information, see Run MATLAB Functions in ThreadBased Environment.
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
Usage notes and limitations:
The
"movmedian"
moving method is not supported.The
SamplePoints
andDataVariables
namevalue arguments are not supported.
For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Version History
Introduced in R2017aR2022a: Return table or timetable containing logical output
For table or timetable input data, return a tabular output TF
instead of a logical array by setting the OutputFormat
namevalue
argument to "tabular"
.
R2021b: Specify sample points as table variable
For table input data, specify the sample points as a table variable using the
SamplePoints
namevalue argument.
See Also
Functions
rmoutliers
ischange
islocalmax
islocalmin
filloutliers
ismissing
Live Editor Tasks
Apps
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
 América Latina (Español)
 Canada (English)
 United States (English)
Europe
 Belgium (English)
 Denmark (English)
 Deutschland (Deutsch)
 España (Español)
 Finland (English)
 France (Français)
 Ireland (English)
 Italia (Italiano)
 Luxembourg (English)
 Netherlands (English)
 Norway (English)
 Österreich (Deutsch)
 Portugal (English)
 Sweden (English)
 Switzerland
 United Kingdom (English)