How can I detect and remove outliers from a large dataset?

12 views (last 30 days)
I am presently trying to process a large dataset (n = 5000000) and I am really facing challenges writing codes that could detect and remove all the outliers present in the dataset. I tried using modified thomson tau method but it didnt work and I am presently trying to apply modified z- score method but still cant make a head go with the matlab codes.
Attached is the plot of the signal with peaks and dips for better understanding. I also want to fill the deleted outlier points with an interpolation and would appreciate a suggestion.
Please, I will appreciate any further assistance on how to get rid of the peaks and dips on the signal and how to fill the removed outliers points with an interpolation.
I will appreciate any suggestions on other methods to use to remove the outliers and if possible codes for the method.
thank you.
  2 Comments
Star Strider
Star Strider on 12 Mar 2014
Do you have any trends in your data that you could model, perhaps with nlinfit or other regression routines? I have no idea what you are doing or what your data are, but detecting trends and other patterns first could make your task easier.
Arinze
Arinze on 15 Mar 2014
Star Strider, I attached a picture of the plot for better understanding.

Sign in to comment.

Answers (4)

Shahab B
Shahab B on 30 Sep 2016
How can I used it for simple data such as: main=[0 347.666506871168 97.948966303887 98.8584847142621 96.4002074686564];
note that the outlier data is = 347.666506871168 .
  4 Comments
Image Analyst
Image Analyst on 21 Nov 2016
There are several definitions of MAD. My code above does definition 1.2.1 as listed on this page https://en.wikipedia.org/wiki/Average_absolute_deviation which gives 4 definitions using all combinations of mean and median. You're welcome to use whichever of those definitions best meets your needs.
Nivodi
Nivodi on 14 Aug 2018
Image Analyst, how can I apply this part of your code to several columns?
% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)t% Compute the median absolute difference
meanValue = mean(vector)
% Compute the absolute differences. It will be a vector.
absoluteDeviation = abs(vector - meanValue)
% Compute the median of the absolute differences
mad = median(absoluteDeviation)
% Find outliers. They're outliers if the absolute difference
% is more than some factor times the mad value.
sensitivityFactor = 6 % Whatever you want.
thresholdValue = sensitivityFactor * mad;
outlierIndexes = abs(absoluteDeviation) > thresholdValue
% Extract outlier values:
outliers = vector(outlierIndexes)
% Extract non-outlier values:
nonOutliers = vector(~outlierIndexes)

Sign in to comment.


Image Analyst
Image Analyst on 12 Mar 2014
That's not large. It's just a fraction of the size of a typical digital image. You can use "deleteoutliers" from Brett Shoelson of the Mathworks: http://www.mathworks.com/matlabcentral/fileexchange/3961-deleteoutliers Or you could try the Median Absolute Deviation (a popular statistical method for detecting outliers) as demonstrated on an image in the file I attached.
  7 Comments
Arinze
Arinze on 15 Mar 2014
Edited: Arinze on 15 Mar 2014
new plot of the signal, Please my matlab doesnt recognise 'deleteoutliers' command, any idea why?

Sign in to comment.


Tim leonard
Tim leonard on 12 Mar 2014
Trimming your values based on percentiles is quick and powerful -
vector = randi(100,100,1);
percntiles = prctile(vector,[5 95]); %5th and 95th percentile
outlierIndex = vector < percntiles(1) | vector > percntiles(2);
%remove outlier values
vector(outlierIndex) = [];
  1 Comment
Image Analyst
Image Analyst on 12 Mar 2014
But something at the 1% or 99% or 100% percentile is not necessarily an outlier so you could be getting rid of good data. It's quick but I wouldn't call it powerful. I'd call it risky, unless you know for a fact that you have a certain specific amount of noise present.

Sign in to comment.


Amir H. Souri
Amir H. Souri on 26 Jun 2017
Hi, I may be late, but I just want to point out that definition of outlier is totally subjective. In order to find them, you need to estimate the probably distribution of your data, and fit a distribution (say for example Gaussian), and check whether it is statistically significant (you may use Kolmogorov–Smirnov test or a bootstrap method). Then, you will be able to identify the outliers by defining the confidence interval. For example, you can say any data within 95% confidence interval are acceptable and others can be ignored as outliers. As I mentioned there is no absolute answer, and it totally depends on the nature of data and how strict you want to be in regards to the confidence interval.
Good luck!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!