Faster array comparison with big datasets

12 views (last 30 days)
KBK007
KBK007 on 5 Mar 2015
Edited: Stephen23 on 2 Apr 2015
I am trying to compare two large arrays to each other and do some extra operations. So far it works but it is very slow. Coming from a little C++ background, I can use pointers to their location instead of taking the value of each. I have found how to do so in matlab using shrlib but I am not sure if it works for doubles.
What I have available:
Two textfiles each in the following format: name in col 1, value1 in col2 and value2 in col 3
filename value1 value2
---------------------------------------
foo0001 100 200
foo0001 140 200
foo0001 135 155
foo0002 124 145
foo0002 135 155
foo0003 1419 1943
. . .
. . .
I would want to:
  1. compare the values of value1 and value2 from the first file to those of the second file
  2. be tolerant to errors with respect to elements from value1 and value2
  3. the file comparison to run fast ( my current method is very slow)
So here is what I have done so far:
file1 = readtable('file1.txt', 'Delimiter',' ','ReadVariableNames',false);
fclose(fileID);
file2 = readtable('file2.txt', 'Delimiter',' ','ReadVariableNames',false);
fclose(fileID);
n = 1000;
detectedCorn = zeros(n,2); %initialize large nx2 vectors to save value1 and value2
referenceCorn = zeros(n,2);
null = ['foo000',int2str(ip)]; % initialize filename as "dynamic" string
for ip = 1:100
iv = 1; % initialize index variables to use later on
iv2 = 1; %
for ii = 1:numel(file1.Var1) %%all Var1 -> name in first col in file1 table
if strcmp(file1.Var1(ii), null) == 1;
var = [file1.Var2(ii),file1.Var3(ii)];
detectedCorn(iv,:) = var;
var = zeros(1,2);
iv = iv + 1;
end
%REPEAT SAME STEP AS ABOVE FOR file2( I will skip that so the code doesnt get too crowded)
for kk = 1:1000
for rr = 1:1000
var3 = [0,0];
if all(ismember(detectedCorn(kk,:),var3)) == 0 &&...
all(ismember(referenceCorn(kk,:),var3)) == 0 %%do not count the empty spaces in the array
if all(ismember(detectedCorn(kk,:),referenceCorn(rr,:))) == 1 %%iterate through the two vectors and compare
% do something
else
% do something else
end
end
end
end
end
end
---------------------------------------------------------------------------
Clarification:
For example I have these files:
file 1: Reference
imagename.png value1 value2
000000004.png 100 200
000000004.png 120 130
000000004.png 150 1500
000000005.png 1214 501
file 2: Detected
imagename.png value1 value2
000000004.png 100 200
000000004.png 204 300
000000004.png 300 400
000000004.png 129 140
000000004.png 152 1499
000000005.png 1200 500
What I mean by compare, I want to do the following: for example the first row in the Detected file (file2) I have row1. First thing I have to make sure that the filenames in the two files are the same. If that so then I can compare the data in x and y. if difference of value1 from Detected file (in this example 100) and the value1 from Reference file is equal to 0,1,2,-1,-2 then, check if difference of value2 of the same row in both files equal to 0,1,2,-1,-2, increment the variable TP (Positive match) with 1. And so on for all files... By that logic the for the above data TP = 2, FP = data found in Detected but not in reference and FN is data found in reference but not in detected.
Notice that file detected has way more entries than the first file and does not necessarily have the same filenames
  9 Comments
KBK007
KBK007 on 5 Mar 2015
stephen I know this is gona be a dumb question to ask, but I guess it is something you guys have encountered before, isn't a table "vectorization" ? it is a nx3 dimensional array is it not ? or should I put it in a matrix structure and then pull some operations on it? I am going to check out the definition of a table and try to put the data in a nx2 (for each image) matrix. Is that what you mean ?
Oleg Komarov
Oleg Komarov on 2 Apr 2015
Have you tried running the profiler to see where the bottlenecks are?

Sign in to comment.

Answers (3)

Stephen23
Stephen23 on 5 Mar 2015
Edited: Stephen23 on 5 Mar 2015
My version of MATLAB is a little long in the tooth and does not support tables, so I used simple numeric arrays instead, but this code should get you started on your way. One big hint for using MATLAB: learn to navigate and use the documentation , it is actually very helpful.
fid = fopen('tempDet.txt','rt');
DetC = textscan(fid,'%s%f%f','HeaderLines',2,'CollectOutput',true);
fclose(fid);
fid = fopen('tempRef.txt','rt');
RefC = textscan(fid,'%s%f%f','HeaderLines',2,'CollectOutput',true);
fclose(fid);
[DetV,~,DetX] = unique(DetC{1});
[RefV,~,RefX] = unique(RefC{1});
assert(all(strcmp(DetV,RefV)),'Some filenames are missing in one of the files!')
for k = reshape(unique(DetX),1,[])
DetK = DetX==k;
RefK = RefX==k;
DetN = sum(DetK)
RefN = sum(RefK)
intX = intersect(DetC{2}(DetK,:),RefC{2}(RefK,:),'rows')
intN = size(intX,1)
end
It opens each file (attached), checks that the filenames correspond, then for each filename finds the ones with matching numeric values (both values in the row must be the same). The sums and the intN tell you how values there are for each file, and many match correctly, respectively. The (rather simple) data files that I used are attached here:
  1 Comment
KBK007
KBK007 on 6 Mar 2015
Thank you stephen for your time and contribution. My only concern is this program seems to require that for example the foo0001 from the first file MUST have a corresponding foo0001 in the other file. However since I am testing an algorithm against another, I may detect some false positives and may have data for foo0001 despite the fact that in the reference file there is no foo. I will look at it myself but thank you for the example of vectorization, now i sorta get what you were aiming for.

Sign in to comment.


Peter Perkins
Peter Perkins on 5 Mar 2015
KBK007, as Stephen said, your code is unvectorized, and that's why it is slow. It is also complicated enough that its purpose kind of hard to figure out. It seems like you're doing something along the lines of finding data in one file that are also present in another file. That may be related to a join operation, or some kind of intersection, or even some kind of grouped calculation. There are functions specifically for those operations, but it's really hard to say what you should be using without a clear description of what data you have and what you want do with it. In words, not in code. You should be able to make a very small pair of example data files, and show exactly what results you want from that.

Peter Perkins
Peter Perkins on 9 Mar 2015
Given Stephen's interpretation of your code, and some similar data, here is a vectorized version that prints out
  • filenames that occur in only one of the two tables
  • the number of rows for each filename in each table
  • rows with filename/value1 in common between the two tables
Not sure if this does what you're looking for, but at least it will give you an idea of what is meant by "vectorization". Hope it helps.
>> DetC = table( ...
{'foo0001'; 'foo0001'; 'foo0001'; 'foo0002'; 'foo0002'; 'foo0003'; 'foo0004'}, ...
[101; 141; 133; 124; 135; 1419; 1420], ...
[199; 200; 154; 145; 152; 1933; 1934], ...
'VariableNames',{'filename' 'value1' 'value2'});
>> RefC = table( ...
{'foo0001'; 'foo0002'; 'foo0002'; 'foo0003'}, ...
[141; 124; 123; 1419], ...
[200; 144; 130; 1933], ...
'VariableNames',{'filename' 'value1' 'value2'});
>> % Find filenames that occur in only one of the two tables
>> setxor(DetC.filename,RefC.filename)
ans =
'foo0004'
>> % Find the number of rows for each filename in each table
>> [~,~,uindices] = unique(DetC.filename);
>> accumarray(uindices,1)
ans =
3
2
1
1
>> [~,~,uindices] = unique(RefC.filename);
>> accumarray(uindices,1)
ans =
1
2
1
>> % Find the rows with filename/value1 in common between the two tables
>> intersect(DetC(:,1:2),RefC(:,1:2))
ans =
filename value1
_________ ______
'foo0001' 141
'foo0002' 124
'foo0003' 1419
  3 Comments
KBK007
KBK007 on 10 Mar 2015
Edited: KBK007 on 10 Mar 2015
I don't think vectorization can be applied here. I don't have the same data size and when comparing matrices or the difference between the values, each value has to be present in both. This means the dimensions should agree otherwise I think it is loopy loop for me. Also I really want to get the difference of each value against all others in the other file, it is compulsory to use a loop
Stephen23
Stephen23 on 2 Apr 2015
Edited: Stephen23 on 2 Apr 2015
You can still compare data in differently-sized groups by using bsxfun. It is a very fast and efficient tool that can be used in vectorized code, and allows you to compare differently sized vectors very quickly. You can do something like this:
>> X = [1,2,4,6,8];
>> Y = [1,3,4,5,7,8,9];
>> bsxfun(@eq,X(:),Y)
ans =
1 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 1 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 1 0
You can then use any and all or other logical operations to select which values you want to look at. And of course you can choose or define the function use in bsxfun (I used the inbuilt eq).
If you decide to go with using a loop, then you really need to read this:

Sign in to comment.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!