Clear Filters
Clear Filters

How can I rapidly filter a sorted vector for elements separated by a minimum value, processing one-way from the first value?

2 views (last 30 days)
Using MATLAB R2012b, I have a (large) sorted vector of elements. Starting from the initial value, I need to discard all elements that are within some particular range of any previous undiscarded element. Simply looking for any set with the given minimum separation will not give the unique solution I require.
(Slow, but obvious) MWE:
function vec2 = testfun(vec,threshold)
idx = 0.1;
vec2 = vec;
while not(isempty(idx));
testdiff = diff(vec2);
idx = find(testdiff<threshold,1);
vec2(idx+1) = [];
end
end
When I call this on
vec = [1,13,27,30,33,39,53,54,100];
threshold = 25;
I should only retain:
vec2 == [1,27,53,100]
but calling
testfun(vec,10)
should give me
vec2 = [1, 13, 27, 39, 53, 100]
When I call it with a script to randomly generate larger vectors:
numval = 1e5;
threshold = 25;
vec = sort(rand(numval,1).*numval*1e2);
vec2 = testfun(vec,threshold);
the profiler indicates the following breakdown within the function:
vec2(idx+1) = []; 66%
dx = find(vecdiff<thresho... 20%
vecdiff = diff(vec2); 12%
Clearly, the re-indexing and allocation of vec2 after deleting elements is slow. Yet, if I don't take those elements out (e.g., I could set them = 0 instead of []), I have to find a way to ignore negative and zero elements in the diff output, but continue to search for results below the threshold value from the previous retained element. The total runtime in the function appears to behave quadratically, although the idx = find() line begins to take up more of the total time as the length of the vector increases (517s, 46/40/13% for numval = 4e5).
  2 Comments
Jan
Jan on 17 Feb 2017
@Stephen: You are right. Code formatting, test data, profiler used already, explained what has been tried already, reaction to asnwers beyond "does not work". +1

Sign in to comment.

Accepted Answer

Jan
Jan on 16 Feb 2017
Edited: Jan on 17 Feb 2017
function vec = testfun2(vec, threshold)
idx = 1;
n = length(vec);
keep = false(1, n);
while ~isempty(idx)
keep(idx) = true;
iThresh = vec(idx) + threshold;
idx = find(vec > iThresh, 1); % [EDITED]
end
vec = vec(keep);
[EDITED, 2017-02-17 08:24 UTC]:
A version which runs in linear time:
function vec = testfun3(vec, threshold)
n = length(vec);
keep = false(1, n);
lim = vec(1) + threshold;
keep(1) = true;
for k = 2:n
if vec(k) > lim
keep(k) = true;
lim = vec(k) + threshold;
end
end
vec = vec(keep);
Timings under R2016b/64:
numval = 1e5;
threshold = 25;
vec = sort(rand(numval,1).*numval*1e2);
tic; r1 = testfun(vec, threshold); toc % Original
tic; r2 = testfun2(vec, threshold); toc % FIND(vec > iThresh)
tic; r3 = testfun3(vec, threshold); toc % Dull loop
isequal(r1, r2, r3)
Elapsed time is 44.117896 seconds.
Elapsed time is 10.090938 seconds.
Elapsed time is 0.003627 seconds.
1 % Results are equal
Whoooops.
Could you please cross-check this? A speedup of factor 12000 usually means, that something important has been overseen.
  5 Comments
Jan
Jan on 17 Feb 2017
The profiler disables the JIT acceleration, because the JIT can Change the order of commands to optimize the code and this would confuse the profiling. This is very sad, because code which profits from the JIT cannot be profiled accurately such that further optimizations require primitive tic/toc's instead.
Jan
Jan on 20 Feb 2017
Edited: Jan on 20 Feb 2017
@Daniel: For large data, the line vec = vec(keep) can be accelerated using FEX: CopyMask. When the complete function impelemented in C, the runtime for a 1e8 input decreased from 3.3 to 1.1 sec on my old Core2Duo. Do you have a C-compiler?

Sign in to comment.

More Answers (1)

Walter Roberson
Walter Roberson on 17 Feb 2017
You can run in linear time.
function vec2 = testfun(vec, threshold)
vec2 = vec; %duplicate vector
target = vec(1) + threshold;
used = 1;
for K = 2 : length(vec)
if vec(K) < threshold; next; end
used = used + 1;
vec2(used) = vec(K);
target = vec(K) + threshold;
end
vec2(used+1:end) = [];
Here I interpreted "within some particular range" as being exclusive -- that, for example, 5.0 would not be "within" 1.0 of 4.0. If that should be inclusive then change from < to <=
vec2 starts as a duplicate of vec1 so that it has the same data type and size -- and not having to copy over vec(1) is a bonus.
Items are copied into positions in vec2 as needed. Only a portion of vec2 is logically "in use" at any one time. At the end, the unused positions are discarded. This arrangement requires only a single memory allocation at the beginning, and then throwing away the trailing bit of it at the end (I'm not sure if that requires copying it.)
  1 Comment
Daniel
Daniel on 17 Feb 2017
Edited: Daniel on 17 Feb 2017
Fantastic, Walter! This works quite well, with one fix:
if vec(K) < target; continue; end
Thank you very much. I'm not worried about the inclusivity, since the chance of it occurring is minuscule - but it was good of you to point out that ambiguity. I can now get through a vector with 1e8 elements in slightly less time than 5e5 took before.
[EDIT] That was using the profiler to time the run. Using tic/toc, I get through in the same time as 1e5 used to take.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!