Compare two list of strings line by line for a match and summarize results.

Question

Jon Thornburg on 29 Mar 2021

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/786949-compare-two-list-of-strings-line-by-line-for-a-match-and-summarize-results

Edited: dpb on 7 Apr 2021

Accepted Answer: dpb

Open in MATLAB Online

I have a script that works with the test data files.

It breaks down 2 input files of differnt sizes, into strings, then compares line by line for a match.

How do add the results from each iteration to a summary vector?

How do identify the lines that match?

In the test data it is str_chk(8) and str_readin(8) which would be line #162 in the results.

For the test data I have been copy/pasting the command window result into an excel file.

%AutoBOM
clear
%input search list 
cd 'C:\Desktop\AutoBOM'
[n,t,r]=xlsread('readin.csv');
    %convert it to a string
    str_readin=string(r);
    str_readin=lower(str_readin);  
        
% input list of known word to compare to     
%input libary
[a,b,l]=xlsread('check_list.csv');
    %convert it to a string
    str_chk=string(l);
    str_chk=lower(str_chk);
%Compare line by line for a match    
for i=1:numel(str_chk)
    for j=1:numel(str_readin)
        if str_chk(i) == str_readin(j);
            disp('Match')
        % How do I add the result to a new vector?
        % How do I identify the lines that match? 
        else disp('No')
        %add displayed word to the same vector
        end
    end
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 29 Mar 2021

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/786949-compare-two-list-of-strings-line-by-line-for-a-match-and-summarize-results#answer_661854

Edited: dpb on 1 Apr 2021

Open in MATLAB Online

Don't need to explicitly loop -- MATLAB has functions builtin to do that for you.

readin=lower(string(textread('readin.csv','%s','delimiter','\n')));
check=lower(string(textread('check_list.csv','%s','delimiter','\n')));
[ia,locb]=ismember(readin,check);

gives you

>> find(ia)
ans =
     8
>> readin(ia)
ans = 
    "joker"
>> check(locb~=0)
ans = 
    "joker"
>> 

So, depending upon what it is you need to return, you have the locations in the read in string array found in the check string array in ia and the location of the first matching location in the check array (if more than one) for the associated string.

See the doc for ismemeber for the full details on input/output arguments.

ADDENDUM:

>> readin=[readin;"2"];         % add another element that is duplicated in check
[ia,locb]=ismember(readin,check);
found=readin(ia);
% illustrate what we get...
>> found
found = 
  2×1 string array
    "joker"
    "2"
>> 
nFound=numel(found);
locsInCheck=cell(nFound,1);           % preallocate cell array for locations
for i=1:nFound
  locsInCheck(i)={find(found(i)==check)};
end
% show what we gots...
>> locsInCheck
locsInCheck =
  1×2 cell array
    {[8.00]}    {6×1 double}
>> 

You've now got the strings found in the check array and where all of them are by string along with the string itself.

Only the second loop needed to find for each by string; the first loop is inside the much more efficient builtin ismember function to do the hard work.

If you wanted, you could replace the explicit for...end loop above with a cellfun construct

locsInCheck=cellfun(@(s) find(s==check),found,'uniform',0);

9 Comments
Show 7 older commentsHide 7 older comments

dpb on 1 Apr 2021

Edited: dpb on 2 Apr 2021

Open in MATLAB Online

"if 'joker' is listed in readin(2) and in more than one locations in the check file example check(8) and check(16), the provided code will fail to count both matching string. "

'joker' is NOT duplicated in check, or at least in the uploaded version of it.

You misunderstand the output of ismember -- it returns the logical of a match for each in the first ("A") argument existing in second ("B") argument. Ergo, it is of length(A). locb does only return the first match location in B, granted.

However, if your readin array is unique, then you can do the wanted by simply reversing the order of the two arguments and find each entry in the check array contained in readin array -- the logical array will be true for every element that is matched, if some are the same/repeated, then they will all show up in ia. Then the locb location will give you the location of that string in readin which would point to the same string multiple times in the one-to-many relation.

You've not defined whether readin is or is not unique??? The first solution works for either case, the latter only if one or the other array is unique.

Either way, you only need a result array that is the size of one or the other list, not the product of the two sizes.

Here's the t'other way 'round solution with the above case of having added the extra "2" into the readin array to have a duplicate of something in check

First

>> check([8 16])
ans = 
  2×1 string array
    "joker"
    "silver"
>> 

aren't duplicates.

>> [ia,locb]=ismember(check,readin);  % look at elements of check in readin instead
>> check(ia)                          % what it finds -- all the cases of each
ans = 
  7×1 string array
    "2"
    "2"
    "2"
    "2"
    "2"
    "2"
    "joker"
>> readin(locb(locb>0))   % the _readin_ strings to match each check value found
ans = 
  7×1 string array
    "2"
    "2"
    "2"
    "2"
    "2"
    "2"
    "joker"
>> locb(locb>0)    % the actual locb values -- note "2" is same string repeated
ans =
         23
         23
         23
         23
         23
         23
          8
>>     

Either way solves your problem depending upon the characteristics of the data -- if one array is not duplicated (or you can use unique to just examine the unique values), then you can solve it in one line of code and no explicit loops. If must keep the duplicates and can have duplicates in both, then need two code lines and a loop (implicit with cellfun).

dpb on 7 Apr 2021

Edited: dpb on 7 Apr 2021

No, that isn't actually the goal; the "line by line" came from OP's initial implementation by using doubly-nested for..end loops to walk through both arrays line by line in the first to all lines in the second. Instead, using ismember one gets the loop internally with compiled code much more efficiently to locate those in the first that do have matches in the second.

However, the goal was to find all occurrences of one line in the one array in the other where there can be duplicates and also return the duplicated elements. It was never made clear whether both arrays may contain duplicates or whether one of them is unique; in the latter case one can get by with no loops; if the former one still has to loop over the found locations of one to return the duplicates in the other since ismember only returns the first match of possibly many as its second argument; I've lobbied for years for an alternate optional cell output that would return all matches directly but so far no joy on that front.

Adam Danz on 7 Apr 2021

Open in MATLAB Online

I can see how the additional ismember output would be helpful.

This doesn't really avoid loops but with string arrays that aren't too long,

A = ["A" "B" "C" "B"];
B = ["B" "B" "C" "B" "A"];
[row, col] = find(A(:)==B(:).');
arrayfun(@(i){B(col(row==i))}, 1:numel(A))
ans = 1×4 cell array
    {["A"]}    {["B"    "B"    "B"]}    {["C"]}    {["B"    "B"    "B"]}
% Or to return Bi, 
% arrayfun(@(i){col(row==i)}, 1:numel(A))

Sign in to comment.

Compare two list of strings line by line for a match and summarize results.

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments
Show 7 older commentsHide 7 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Compare two list of strings line by line for a match and summarize results.

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments Show 7 older commentsHide 7 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

9 Comments
Show 7 older commentsHide 7 older comments