Fastest way to replace multipe substrings with a single new string?

9 views (last 30 days)
Hello Everyone,
I'm trying to replace 7k different substrings with the same Tag in a 50 milllion words dataset (cell array of size 1 million of strings of average size 50 words). and as you can see, using replace or regexprep takes a long time. I tried using strrep the same way as replace but it gives me this error.
Error using strrep
All nonscalar inputs must be the same size.
I want to ask, what is the fastest and less memory consuming way to do it?
Here is the code:
%using replace
Tag='IMPORTANT'
substr={'very','much'} % a cell array of +7k words
reptag=cell(1,size(substr,2));
tagcell=cellfun(@(x) Tag,reptag,'Uniformoutput',false);
maintext=replace(maintext,substr,tagcell);
% using regexprep
ev='(';
for evi=1:size(substr,2)
ev=[ev substr '|'];
end
ev=[ev(1:end-1) ')'];
maintext=regexprep(maintext,ev,Tag);
  4 Comments
Omar Salah
Omar Salah on 10 Jun 2020
@james I can actually work with both. Either a cella rray of character vectors or a cell of strings. I move between them easily. Is one type faster than the other?
Omar Salah
Omar Salah on 10 Jun 2020
@stephen I never worked with C++ but I'm wondering, why would they be faster? Is it because they are compiled or because C++ functions are generally faster?

Sign in to comment.

Answers (1)

Mohammad Sami
Mohammad Sami on 11 Jun 2020
After some experimentations I think that if you tokenize your sentences, you can use a hashmap to lookup the words to replace.
An example code is as follows. If you want case insensitive matching, use function lower on both the words and sentences.
substr = cellstr(substr);
w = containers.Map(substr,substr); %create a hashmap of substring you want to replace
m2 = cellstr(sentences);
m5 = cell(length(m2),1);
for i = 1:length(m2)
m3 = split(m2{i},' '); % tokenize the sentence
m4 = w.isKey(m3); % lookup which words to replace
m3(m4) = {'IMPORTANT'}; % replace the words
m5(i) = join(m3,' '); % store the updated sentence
end
  1 Comment
Omar Salah
Omar Salah on 18 Jun 2020
Wow! thanks. that's definitely something to try. I will try it tonight ang get back to you :)

Sign in to comment.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!