Compare two list of strings line by line for a match and summarize results.

19 views (last 30 days)
I have a script that works with the test data files.
It breaks down 2 input files of differnt sizes, into strings, then compares line by line for a match.
How do add the results from each iteration to a summary vector?
How do identify the lines that match?
In the test data it is str_chk(8) and str_readin(8) which would be line #162 in the results.
For the test data I have been copy/pasting the command window result into an excel file.
%AutoBOM
clear
%input search list
cd 'C:\Desktop\AutoBOM'
[n,t,r]=xlsread('readin.csv');
%convert it to a string
str_readin=string(r);
str_readin=lower(str_readin);
% input list of known word to compare to
%input libary
[a,b,l]=xlsread('check_list.csv');
%convert it to a string
str_chk=string(l);
str_chk=lower(str_chk);
%Compare line by line for a match
for i=1:numel(str_chk)
for j=1:numel(str_readin)
if str_chk(i) == str_readin(j);
disp('Match')
% How do I add the result to a new vector?
% How do I identify the lines that match?
else disp('No')
%add displayed word to the same vector
end
end
end

Accepted Answer

dpb
dpb on 29 Mar 2021
Edited: dpb on 1 Apr 2021
Don't need to explicitly loop -- MATLAB has functions builtin to do that for you.
readin=lower(string(textread('readin.csv','%s','delimiter','\n')));
check=lower(string(textread('check_list.csv','%s','delimiter','\n')));
[ia,locb]=ismember(readin,check);
gives you
>> find(ia)
ans =
8
>> readin(ia)
ans =
"joker"
>> check(locb~=0)
ans =
"joker"
>>
So, depending upon what it is you need to return, you have the locations in the read in string array found in the check string array in ia and the location of the first matching location in the check array (if more than one) for the associated string.
See the doc for ismemeber for the full details on input/output arguments.
ADDENDUM:
>> readin=[readin;"2"]; % add another element that is duplicated in check
[ia,locb]=ismember(readin,check);
found=readin(ia);
% illustrate what we get...
>> found
found =
2×1 string array
"joker"
"2"
>>
nFound=numel(found);
locsInCheck=cell(nFound,1); % preallocate cell array for locations
for i=1:nFound
locsInCheck(i)={find(found(i)==check)};
end
% show what we gots...
>> locsInCheck
locsInCheck =
1×2 cell array
{[8.00]} {6×1 double}
>>
You've now got the strings found in the check array and where all of them are by string along with the string itself.
Only the second loop needed to find for each by string; the first loop is inside the much more efficient builtin ismember function to do the hard work.
If you wanted, you could replace the explicit for...end loop above with a cellfun construct
locsInCheck=cellfun(@(s) find(s==check),found,'uniform',0);
  9 Comments
dpb
dpb on 7 Apr 2021
Edited: dpb on 7 Apr 2021
No, that isn't actually the goal; the "line by line" came from OP's initial implementation by using doubly-nested for..end loops to walk through both arrays line by line in the first to all lines in the second. Instead, using ismember one gets the loop internally with compiled code much more efficiently to locate those in the first that do have matches in the second.
However, the goal was to find all occurrences of one line in the one array in the other where there can be duplicates and also return the duplicated elements. It was never made clear whether both arrays may contain duplicates or whether one of them is unique; in the latter case one can get by with no loops; if the former one still has to loop over the found locations of one to return the duplicates in the other since ismember only returns the first match of possibly many as its second argument; I've lobbied for years for an alternate optional cell output that would return all matches directly but so far no joy on that front.
Adam Danz
Adam Danz on 7 Apr 2021
I can see how the additional ismember output would be helpful.
This doesn't really avoid loops but with string arrays that aren't too long,
A = ["A" "B" "C" "B"];
B = ["B" "B" "C" "B" "A"];
[row, col] = find(A(:)==B(:).');
arrayfun(@(i){B(col(row==i))}, 1:numel(A))
ans = 1×4 cell array
{["A"]} {["B" "B" "B"]} {["C"]} {["B" "B" "B"]}
% Or to return Bi,
% arrayfun(@(i){col(row==i)}, 1:numel(A))

Sign in to comment.

More Answers (0)

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!