Problme with Text analysis

Question

David MERCIER on 19 Oct 2021

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/1567363-problme-with-text-analysis

Answered: DGM on 19 Oct 2021

Hi, I try to clean a table containing both latin and non-latin strings to plot a wordcloud. I used regexprep function but not successfully. I can't remove korean strings. Any idea? Here an example of the code and the output:

pathName = 'Keyword Aug. 2020 to Oct. 2021_MatlabSmall.xlsx';
T = readtable(pathName,'Range','A:B');
% Convert all Character Vector to Lowercase
T.Keyword = lower(T.Keyword);
% Remove not useful keywords
T(strcmp(T.Keyword, '(not provided)'), :)=[];
T(strcmp(T.Keyword, '(not set)'), :)=[];
% Set lower case
T.Keyword = lower(T.Keyword);
% Remove links
T(contains(T.Keyword, 'http'), :)=[];
T(contains(T.Keyword, '.'), :)=[];
T.Keyword = strrep(T.Keyword, ' ', '_');
display(head(T));
% Replace non alphanumerics
T.Keyword = regexprep(T.Keyword,'^a-z','');
 
8×2 table
                 Keyword                 Sessions
    _________________________________    ________
    'stuff'                                390   
    'forum'                                128   
    'student'                               76   
    '재료'                                  59   
    'stuff'                                 56   
    'uninstall_stuff_license_manager'       52   
    'stuff_resource_center'                 43   
    'stuff_student_community'               34   

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

DGM on 19 Oct 2021

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/1567363-problme-with-text-analysis#answer_812213

Open in MATLAB Online

I'm terrible with regex, but this might get you somewhere. Replaces everything but lowercase alpha and underscores.

A = {'9.banana' 'orange-123_juice' 'ン戦国時' 'apple_sauce' 'abcクルミ' 'peach' 'pear' 'ピラミッド' 'cherry'}.'
A = 9×1 cell array
    {'9.banana'        }
    {'orange-123_juice'}
    {'ン戦国時'         }
    {'apple_sauce'     }
    {'abcクルミ'        }
    {'peach'           }
    {'pear'            }
    {'ピラミッド'       }
    {'cherry'          }
B = regexprep(A,'[^a-z_]','')
B = 9×1 cell array
    {'banana'      }
    {'orange_juice'}
    {0×0 char      }
    {'apple_sauce' }
    {'abc'         }
    {'peach'       }
    {'pear'        }
    {0×0 char      }
    {'cherry'      }

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Problme with Text analysis

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Problme with Text analysis

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments