Alternatives to using ismember() for string versus cell array of strings

37 views (last 30 days)
Back when I was using R2009b, I got into the habit of using ~isempty(find(strcmp(thisstring,cellarrayofstrings))) in place of ismember(). In the case of testing a single string against an array of strings, this was about 50x as fast for my particular test arrays in R2009b.
Going back and revisiting some of the same optimization motivations in newer versions (particularly R2015b, though the students use newer versions), I find that it's still about 2x-4x as fast. As these sort of tests are often part of parsing input arguments to a function, these little bits of time just add to the overhead (sometimes as much as 20% of total execution time), and I'd like to pare them down where I can.
I know Jan has CStrAinBP() on the FEX, but at least for a single string versus an array, it's slower than my current approach. While the MEX variant is probably better, I can't practically introduce a dependency on something that a student is going to need priviliges or extra know-how to compile. Compiling MEX functions is just not really an option. That probably rules out ideal solutions.
I've seen some suggestions regarding pre-sorting, though I haven't observed any advantage in doing so. I've seen suggestions to use ismembc, though that requires enough ancillary preparation and output handling that it winds up being slower any way I try to use it. I've tried just making the expression slightly less ugly by doing something like any(strcmp(thisstring,cellarrayofstrings)) or sum(strcmp(thisstring,cellarrayofstrings))>0, though with some sets, those were very slightly slower in older versions.
To clarify, I'm only testing a single string against a cell array of 5-20 strings. I know I'm probably misguided to be chasing a few microseconds, but I'd be glad to know if someone knows a faster or less ugly way than my workaround.
  2 Comments
Walter Roberson
Walter Roberson on 19 Jul 2020
any(strcmp(thisstring,cellarrayofstrings))
Possibly. In the case that thisstring is character vector or cell array with one character vector or scalar string object, then the any() is not needed.
DGM
DGM on 19 Jul 2020
I guess I shouldn't be calling them strings, should I? Sorry, I'm talking about char arrays and cell arrays of chars. If I use something like strcmp({'aaa'},{'aaa','bbb','ccc','ddd'}), I still get a logical vector that I'd need to reduce to a single logical value. Am I misunderstanding something?

Sign in to comment.

Accepted Answer

Bruno Luong
Bruno Luong on 19 Jul 2020
Edited: Bruno Luong on 19 Jul 2020
In general my preference is "any(strcmp(s,cellchar))". I'm not sure if I did some benchmark to come to this conclusion, what I know is my codes have exclusively this syntax. Sometime I also use "ismember(s, c)" for clarity and maintainability.
Depending on what those chars mean in the program, but if they represents a finite set of something, better not to use char/string at all, but replace them with integer values (sort of enum). Lately MATLAB introduces Enumeration Class. I haven't use them yet, so no comment about speed.
Here is however an update of benchmark times of various methods for R2020a run on my Windows laptop (sec):
ismember(s,cellchar): 0.746642
~isempty(find(strcmp(s,cellchar),1)): 0.490355
any(strcmp(s,cellchar)): 0.282890
ismember(s,cellstring): 7.404710
You'll observe that the STRING class (last result, see code bellow for more detail) is remarkable slow compared to historical CHAR class.
This is a general common observation of mine: As most of the recent class MATLAB introduces (table, string, ...). They might look nicer in term of objects and method, but often they are built on top of leaner original class, thus slower. I also find their definitions/methods often have too much profileration, and users lost sense of what is beneath, whereas the historical MATLAB class is just right, closer "mathematically truth" of what those objects are. People who start learning MATLAB seem to get more confuse than ever. The small differences bewteen different classes/methods become overwhelming, and users will more likely lost their focus on what they suppose to do, meaning carry out the algorithm in an hight (more abstract) level.
For reference the benchmark code is bellow; it generates randomly a cell of 8 chars, each of them have length(8). The test loop on 9 cases where the string would be found in position 1-8 of the "patterns", plus the case where the string is not found.
function benchstrcmp
ntests = 100000;
cellchar1 = cellstr(char('a'-1+randi('z'-'a'+1,9,8)));
cellchar = cellchar1(2:end);
%%
tic
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = ismember(s,cellchar);
end
end
t=toc;
fprintf('ismember(s,cellchar): %f\n', t);
%%
tic
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = ~isempty(find(strcmp(s,cellchar),1));
end
end
t=toc;
fprintf('~isempty(find(strcmp(s,cellchar),1)): %f\n', t);
%% This is the fastest
tic
for k=1:length(cellchar1)
s = cellchar1{k};
for n=1:ntests
b = any(strcmp(s,cellchar));
end
end
t=toc;
fprintf('any(strcmp(s,cellchar)): %f\n', t);
%% This is slowest by large
cellstring1 = string(cellchar1);
cellstring = string(cellchar);
tic
for k=1:length(cellchar1)
s = cellstring1(1);
for n=1:ntests
b = ismember(s,cellstring);
end
end
t=toc;
fprintf('ismember(s,cellstring): %f\n', t);
  6 Comments
DGM
DGM on 19 Jul 2020
I saw that, and the thought crossed my mind. I'll try using the version in R2015b and see if I want to pursue that. I suppose I could just look at timeit.m and learn from that.
Bruno Luong
Bruno Luong on 20 Jul 2020
Edited: Bruno Luong on 20 Jul 2020
Sorry but personally I'm not a big fan of TIMEIT for two 2 reasons:
  1. it tries to correct the how eventually the output parameters transfered to caller variables.
  2. It requires a wrap around function, which can be a strong bias when the "thing" that we investigate has a short runtime.
Why not just run the code natually as in REAL life and doing tic/toc to measure them?
Timeit seems to try to get rid of side times between the entanglement between the CPU to perform the main task and all secondary stuffs needed around for the main task, possibly trying to be independent also of OS activity/jitter by doing median time instead of mean time . This undoing seem to me unnatual and overkill to me.
After all tic/toc is the time user have to wait for MATLAB+OS doing some stuffs with some kind of processor. It cannot be more direct than that, and I'm personally interested in this time returned by tic/toc, nothing else matter more than that, period.
I don't want to use timeit in-place of tic/toc.

Sign in to comment.

More Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!