Fastest multiple binary search in the same sorted large vector
3 views (last 30 days)
Show older comments
Hi All, I have a challenging problem to find the fastest solution for as it is a big bottleneck in my code right now. Vector v [Nx1] contains the sorted rates of v(i)=c(i)/n(i) where c and n are positive integers all of which are accessible. Now I have another vector of candidate rates u that are created by the averages of various pairs of the original v rates, for example u(i,j)=(c(i)+c(j))/(n(i)+n(j)). All I want is to quickly find for all candidate rates in a vector u where it would appear in the original vector v - i.e. what is the first value v that is greater or equal than my candidate rate u(i,j). Since the candidate rates are generated by averageing pairs of rates I know that the searched position would be somwhere between i and j so the fastest solution I have so far is to run the binary search on the section of v(i:j) for the first value greater or equal than u(i,j) and then add the offset i-1. I am not sure if it is possible to exploit efficiently the fact that v is formed by rates of integer values and u is generated by averaging of these rates. If this is not possible to exploit efficiently I guess my question can be reduced how to what is the fastest implementation of multiple binary search within the same sorted (large) vector. For instance is it better to also first sort u and then exploit the results of binary search of the previous smaller u elements to speed up subsequent binary searches of larger values of u? Cheers
2 Comments
Bruno Luong
on 2 Jan 2021
Edited: Bruno Luong
on 2 Jan 2021
All the options you cited might be good.
You can use stock function such as histc and histcounts. They are fast.
You might try some of my FEX using MEX such as
by sorting u and try to find the location in v.
You might adapt the later to start the search in (i,j) if you are fluent in C.
It would be helpful to know how big is your data, how fast you want it to perform and how do you stand now in term of speed.
Answers (3)
Bruno Luong
on 2 Jan 2021
Edited: Bruno Luong
on 2 Jan 2021
v=sort(rand(1e6,1)); u=rand(1e6,1);
ntests = 10;
t=zeros(1,ntests);
for k=1:ntests
tic;
[~,~,loc]=histcounts(u,v); % loc is the last index such that v(loc(i))<=u(i)
t(k)=toc;
end
fprintf('mean time = %g sec\n', mean(t)) % mean time = 0.148221 sec
2 Comments
Bruno Luong
on 2 Jan 2021
Like this.
[~,~,loc] = histcounts(-u,-flip(v));
loc = (length(v)+1)-loc; % loc is the first index such that v(loc(i))>=u(i)
But my other answer provide even faseter solution.
Bruno Luong
on 2 Jan 2021
Use FEX file here
v=sort(rand(1e6,1));
u=rand(1e6,1);
% Make sure v covers u on the right side
v(end)=max(v(end),max(u));
ntests = 10;
t=zeros(1,ntests);
for k=1:ntests
tic
[w,is] = sort(u);
% Fex file
% https://www.mathworks.com/matlabcentral/fileexchange/28930-merge-sorted-arrays?s_tid=srchtitle
[~,j] = mergesa(v,w);
loc = zeros(size(w));
l=length(v);
for n=length(j):-1:1
jn=j(n);
if jn<0
loc(-jn)=l;
else
l=jn;
end
end
loc(is) = loc;
t(k) = toc;
end
fprintf('mean time = %g sec\n', mean(t)) % mean time = 0.0816771 sec
2 Comments
Bruno Luong
on 7 Jan 2021
Edited: Bruno Luong
on 7 Jan 2021
I did not find the similar time ratio as you, especially case 1e7 and 1e7, where discretize is only 3 time faster (you show 15 times faster)
Here is my results (R2020b, Windows laptop) with the attached test script (find_idx is mex binary search with the link I posted above)
--------------------------------------
Number of bins (nv) = 10^(7)
Number of query points (nu) = 10^(7)
Method "discretize" : mean time = 1.95056 sec
Method "find_idx" : mean time = 0.996632 sec
Method "bin_find" : mean time = 6.77919 sec
--------------------------------------
Number of bins (nv) = 10^(8)
Number of query points (nu) = 10^(2)
Method "discretize" : mean time = 0.150994 sec
Method "find_idx" : mean time = 0.00023296 sec
Method "bin_find" : mean time = 0.00030678 sec
You might implement a top level manager function that selects the fast algorithm depending on the size nu or the ratio nu/nv or such.
EDIT: I just improve my FEX submission FIND_IDX using multi threading MEX.
It now becomes the fatest in two test cases.
See Also
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!