# How to replace values above a given percentile by nans

6 views (last 30 days)
Blue on 14 Aug 2019
Commented: Blue on 16 Aug 2019
Hi,
I am trying to replace values above the 99th percentile in a table by nans by using Star Strider excellent little function (https://www.mathworks.com/matlabcentral/answers/127944-how-to-pick-the-j-th-percentile-of-a-vector). An example of my code looks like so:
% Table with nan
t = array2table(vertcat(horzcat([1:1000]', [1001:2000]'), NaN(10, 2)));
% Replace values above 99th percentile by nan
pct = @(v,p) interp1(linspace(0.5/length(v), 1-0.5/length(v), length(v))', sort(v), p*0.01, 'spline');
t.Var1(t.Var1 > pct(t.Var1, 99)) = nan;
t.Var2(t.Var2 > pct(t.Var2, 99)) = nan;
The problem of course is that there are nan values scattered across multiple variables in the table and I therefore receive the following error message: Warning: Columns of data containing NaN values have been ignored during interpolation.
Does anyone has an idea as to how I could replace values above the 99th percentile in a table by nans in a table where there are already multiple nans ? Please note that I do not have the Statistic toolbox.

Adam Danz on 15 Aug 2019
Edited: Adam Danz on 15 Aug 2019
Here is an alternative function that ignores NaN values. It's not a nice one-liner like Star Striders but it does produce the same output as the prctile function in the stats toolbox.
function pctl = percentile(v,p)
% v is a vector of data; example v = [8 9 nan 13 15 11 nan 3 5 7];
% p is a percentile; example p = 75;
% pctl is the exact p_th percentile of v.
vsort = reshape(sort(v(~isnan(v))),1,[]); %row vector
pd = p/100 * numel(vsort);
idx = floor(pd + 0.5) + [0,1];
md = pd - idx(1);
idx(idx<1) = 1;
idx(idx>numel(vsort)) = numel(vsort);
pctl = sum(vsort(idx) .* (0.5+[-md,md]));
end % <-- only needed if this function is within a script
Comparison
v = [8 9 nan 13 15 11 nan 3 5 7];
p = 82;
percentile(v,p) % = 13.120000 this function
prctile(v,p) % = 13.120000 stat's toolbox
pct(v(~isnan(v)),p) % = 13.115995 the one-liner approximation
To apply it to your data,
t = array2table(vertcat(horzcat((1:1000)', (1001:2000)'), NaN(10, 2)));
t.Var1(t.Var1 > percentile(t.Var1, 99)) = nan;
t.Var2(t.Var2 > percentile(t.Var2, 99)) = nan;

#### 1 Comment

Blue on 16 Aug 2019

per isakson on 15 Aug 2019
Edited: per isakson on 15 Aug 2019
Ignore the NaNs explicitely when calculating the percentile. Try
%%
% Table with nan
t = array2table(vertcat(horzcat([1:1000]', [1001:2000]'), NaN(10, 2)));
% Replace values above 99th percentile by nan
t.Var1(t.Var1 > pct(t.Var1, 99)) = nan;
t.Var2(t.Var2 > pct(t.Var2, 99)) = nan;
function z = pct(v,p)
v( isnan(v) ) = [];
z = interp1(linspace(0.5/length(v), 1-0.5/length(v), length(v))' ...
, sort(v), p*0.01, 'spline');
end

#### 1 Comment

Adam Danz on 15 Aug 2019
If you're going to use the one-liner appoximation, you can keep the function handle and explicitly ignore the NaNs from within the call to the function.
t = array2table(vertcat(horzcat((1:1000)', (1001:2000)'), NaN(10, 2)));
pct = @(v,p) interp1(linspace(0.5/length(v), 1-0.5/length(v), length(v))', sort(v), p*0.01, 'spline');
t.Var1(t.Var1 > pct(t.Var1(~isnan(t.Var1)), 99)) = nan;
t.Var2(t.Var2 > pct(t.Var2(~isnan(t.Var2)), 99)) = nan;
% {_____here____}