Match names from two different columns - Comparing strings of different lengths

Question

Maria on 6 Sep 2014

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths

Commented: Maria on 6 Sep 2014

I have a cell variable A with about 4000 rows and 5 columns:

     C1    c2                   c3        c4          c5
  A={1997 'Michelle Applebaum' 'Salmon'  'BASIC'     'STEEL'
    1997  'Jambardella Arnold' 'Butter'  'BASIC'     'STEEL'
    1999  'Cai von Rumohr'     'Cow'     'CAPITAL'   'AEROPLANE'
    2011  'Pierre Smith'       'Milk'    'GOOD'      'AEROPLANE'
    2004  'Jinder Kauffman'    'Star'    'CAPITAL'   'PHONE'

And I have a second cell variable B with about 200 000 rows and 8 columns:

     c1         c2      c3       c4     c5       c6                     c7      c8
  B={2013  29  2225  'ELD1'  29  'SMITH         P'  4817  'HAYWOOD'
     2013  70  2628  'CCRN'  70  'FRANCE        J'  11688  'CANTORFZ'
     2013  02  952  'ABFS'  02  'KAUFFMAN      J'  356  'BUCK'
     2013  20297  157  'DUK'  20297  'ARNOLD        J'  1382  'LAWRENCE'
     2013  78362  260  'APA'  78362  'ARIF          A'  2213  'STIFEL'}

The focus is on c2 of A and on c6 of B.

C2 of A gives the complete first name and last name (and sometimes other names in between) of an individual.
C6 of B gives the last name (in capital letters) and only the initial of the first name of an individual.

I am trying to match both cells . So in case both last names are the same (or silmilar) I would like to add the columns of B to A.

       C1    c2                   c3       c4      c5
    A={1997 'Michelle Applebaum' 'Salmon' 'BASIC' 'STEEL'
       1997 'Jambardella Arnold ' 'Butter' 'BASIC' 'STEEL' 2013 20297 157 'DUK' 20297 'ARNOLD  J' 1382 'LAWRENCE'  
       1999 'Cai von Rumohr'     'Cow'    'CAPITAL' 'AEROPLANE'
      2011  'Pierre Smith'       'Milk'   'GOOD'    'AEROPLANE' 2013 29  2225 'ELD1' 29 'SMITH P'  4817  'HAYWOOD'
      2004  'Jinder Kauffman'     'Star'   'CAPITAL' 'PHONE'     2013 02  952  'ABFS' 02'KAUFFMAN J' 356  'BUCK'

I never matched cells comparing names, and the function has to be case insensitive, ignore points, commas or spaces, and also ignore if the same letter appears twice in a row. I say this because variable A was wrote by me by hand, so it's possible the last name is not exactly equal in both variables. `strrep & strtrim` functions can help solving the problem.

Can someone help me please? Thank you.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 6 Sep 2014

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths#answer_150945

Open in MATLAB Online

You've not specified what happened to compound surnames (like 'von Ruhmor' in your example) or first names (like 'Jean-Pierre'). it also appears that the fields in B are fixed width so what happens to very long surname? So, I've assumed that only the last part of a compound surname is in B, only the initial of compound first name is in B and for very long surname B is larger. I'm also ignoring case where names don't match exactly:

%extract names from A and transform in uppercase 'SURNAME 1STLETTEROFNAME':
namesfromA = upper(regexprep(A(:,2), '([A-Z]).* ([A-Z][a-z]+)', '$2 $1'));
%x=extract names from B and remove extra spaces:
namesfromB = regexprep(B(:, 6), '([A-Z]+) +([A-Z])', '$1 $2');
%find the intersection and postion of matches. Assume there is only ever one match
[~, ia, ib] = intersect(namesfromA, namesfromB);
%add matches to A:
A(ia, 6:13) = B(ib, :);

If surnames don't match exactly, then it gets a lot more complicated. There are a numbers of algorithm ( Damerau-Levenshtein, Hamming) that allows you to find how two strings are similar, but I don't think any of them are built-in in matlab.

1 Comment
Show -1 older commentsHide -1 older comments

Maria on 6 Sep 2014

Open in MATLAB Online

Thank you very much for all the information. I know that getting all the matches will be a problem. I was first thinking of doing something like this:

A(:, end+1) = lower(strrep(strtrim(A(:,2)),' ',''));
B(:,end+1)= lower(strrep(strtrim(B(:,6)),' ',''));

But then I cannot use strcmpi or strncmpi because I am working with different sizes & content even if there is a match. So I was trying to build something that for instance would do a match if they find 5 letters in a row that are equal.

But anyway, your code works and is really helpful! I will read more about the numbers of algorithm. Thank you again.

Sign in to comment.

Answer 2

Stephen23 on 6 Sep 2014

2
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths#answer_150944

You are trying to match words in two cell arrays, including allowing for punctuation characters and possibly slightly different spellings... not an easy task to perform! You will also need to consider initials and name order.

It really depends on how different the strings might be: if the differences only include repeated characters, then you might be able to get away with creating some regexp pattern to help with that.

If the differences are more complicated, then you need to find a similarity measure. Some common similarity measures are the Rabin-Karp algorithm, the Levenshtein distance, the Needleman–Wunsch algorithm, or the Hamming distance.

You will also find submissions on MATLAB File Exchange that support several of these measures for analyzing string similarity, and plenty of examples online.

1 Comment
Show -1 older commentsHide -1 older comments

Maria on 6 Sep 2014

Thank you very much for all the information. I will do some research about the topic of similarity measure. Thank you.

Sign in to comment.

Match names from two different columns - Comparing strings of different lengths

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Match names from two different columns - Comparing strings of different lengths

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

1 Comment
Show -1 older commentsHide -1 older comments