Improve the performance of a function based on str2double

Hi all, I have a function that given a line of text coming from a TXT file containing information of the type:
LINE1: N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT
LINE2: N3 A3 X1.45 ;TEXT
...
After the ;TEXT there could be more information of the same type that would not have to be taken into account, for example:
LINE3: N1 A2 X-5.5 Y9.35 Z-1.5 ;X25 Y-4.44
I give two example lines to try to show that not all lines always contain the same information.
And what I want to obtain is in a matrix (for example A) the information that appears after X, Y or Z and NaN if it does not contain that information. For the example A should be:
A = [5.45 4.45 -10.25
1.45 NaN NaN
-5.5 9.35 -1.5];
The function I am using is the one shown in coordinatesCHAR by entering in tline the line of text in question and in matchWords a cell that would be for this case: matchWords = {'X','Y','Z'};
When the number of lines is low, the processing time is relatively high, but of course, the text files I am working with have some thousands of lines and it is not productive.
I was able to verify that the slowest functions were str2double and regexp. Does someone know how can I improve this?
function XYZ = coordinatesCHAR(tline,matchWords)
% Regulor expression to find matchcase letter.
[a,b] = regexp(tline,'[+-]?\d+(\.\d+)?');
XYZ = NaN(1,length(matchWords));
for ii = 1:length(matchWords)
isfind = strfind(tline,matchWords{ii});
if ~isempty(isfind) && ~isempty(a) && ~isempty(b)
% If isfind has more than one component take the first position
strPos = find(a == isfind(1)+1);
if isempty(strPos)
XYZ(1,ii) = NaN;
else
XYZ(1,ii) = str2double(tline(a(strPos):b(strPos))); % Get the value upto next character
end
end
end
I searched in different forums and tried using the "str2doubleq" function, but the improvement was minimal.
Thank you so much for all.

3 Comments

An example of using this functino should be:
XYZ = coordinatesCHAR('N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT',{'X','Y','Z'})
dpb
dpb on 30 Jan 2021
Edited: dpb on 30 Jan 2021
Is the trailing semicolon always present?
And is the leading "LINE N:" present in the text or just something you've shown?
@dpb I will answer you as follows:
  • Is the trailing semicolon always present?. No, it is sometimes present
  • And is the leading "LINE N:" present in the text or just something you've shown? It is something I have simply shown but it does not appear

Sign in to comment.

 Accepted Answer

dpb
dpb on 30 Jan 2021
Edited: dpb on 30 Jan 2021
"Deadahead" solution without any attempt to use anything fancy...regular expressions are known to be expensive; I've never compared/timed relative to the new string functions to know where they stack up...
function ret=coordinatesCHAR(tline,vars)
% for input line beginning with text, may have trailing comments after semicolon
ret=nan(1,numel(vars));
if contains(tline,';'), tline=extractBefore(tline,';'); end
t=split(tline);
t=t(contains(t,vars));
v=cellfun(@(s)sscanf(s(2:end),'%f'),t);
ix=contains(vars,cellfun(@(s)s(1),t,'uni',0));
ret(ix)=v;
end
For the sample
>> A=[];
>> for i=1:numel(txt),A=[A;coordinatesCHAR(txt(i),vars)];end
>> A
A =
5.4500 4.4500 -10.2500
1.4500 NaN NaN
-5.5000 9.3500 -1.5000
>>
Revised above tested with
> txt
txt =
3×1 cell array
{'LINE1: N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT' }
{'LINE2: N3 A3 X1.45 ' }
{'LINE3: N1 A2 X-5.5 Y9.35 Z-1.5 ;X25 Y-4.44'}
>>
w/o the trailing semicolon. The leading "LINE" is immaterial, actually; just has a little longer string this way but the logic still works.
>> txt
txt =
3×1 cell array
{'N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT' }
{'N3 A3 X1.45' }
{'N1 A2 X-5.5 Y9.35 Z-1.5 ;X25 Y-4.44'}
>> A=[];
>> for i=1:numel(txt),A=[A;coordinatesCHAR(txt(i),vars)];end
>> A
A =
5.4500 4.4500 -10.2500
1.4500 NaN NaN
-5.5000 9.3500 -1.5000
>>

10 Comments

The loop could be brought into the function to handle the array internally would save a little overhead, probably, as well.
Thank you very much for the solution, the only problem I see is that even though I didn't put it in the examples, the line could be:
'LINE4: N1 A2 Y9.35 Z-1.5'
And in this case, the result should be:
A =
NaN 9.35 -1.5
Instead of
A =
9.35 -1.5 NaN
With this I mean that the first column should be for the values of the first variable (X in this case) those of the second for Y and those of the third for Z, I will try to modify the original code.
Hmmm....I surely thought that would have done...I'll look into why doesn't here in a little bit.
Of course! Don't worry! I will try to solve it myself though. Thank you very much for everything!
The order of the arguments to the CONTAINS() is backwards--it's
ix=contains(vars,cellfun(@(s)s(1),t,'uni',0));
Revised gives:
>> txt
txt =
4×1 cell array
{'N1 A2 X5.45 Y4.45 Z-10.25 ;TEXT' }
{'N3 A3 X1.45' }
{'N1 A2 X-5.5 Y9.35 Z-1.5 ;X25 Y-4.44'}
{'N1 A2 Y9.35 Z-1.5' }
>> A=[];
>> for i=1:numel(txt),A=[A;coordinatesCHAR(txt(i),vars)];end
>> A
A =
5.4500 4.4500 -10.2500
1.4500 NaN NaN
-5.5000 9.3500 -1.5000
NaN 9.3500 -1.5000
>>
Be interested to hear how it performs comparatively for speed.
I did update the original...
Wow, amazing, I was still thinking how to try to solve it and I see your answer... thank you so much!
I'll let you know how it goes with the speed issue.
In the case of using the function you just implemented:
And with mine:
Mine is faster... :(
But never mind, don't worry, I thank you very much for all your help.
I knew the cellfun probably would be something; I'm suprised the one w/ sscanf is the top dog there, though.
One can look at how to break those down some; as noted it was definitely all at the highest level first.

Sign in to comment.

More Answers (1)

dpb
dpb on 31 Jan 2021
Edited: dpb on 31 Jan 2021
Variations upon a theme -- this is almost 2X as fast as my previous...here it times out as just a fraction ahead of the original; not sure can beat that by much without mex after this experiment; at least nothing comes to me that would be markedly faster.
The high-level overhead of the cellfun and string data type user-friendly functions are all taken out of the following; str2double calls sscanf to do the work so using it is going backwards (but by surprisingly little) by adding the calling overhead.
regexp pulling tokens turns out to be essentially as fast as using the builtin strfind on each sequentially; that did surprise me somewhat; I wasn't surprised the first try with user-friendly stuff wasn't a performance demon but I expected that getting rid of regexp would show more benefit.
function ret=coordinatesCHAR4(tline,vars)
ret=nan(1,numel(vars));
if contains(tline,';'), tline=extractBefore(tline,';'); end
tline=char(tline);
for i=1:numel(vars)
i1=strfind(tline,vars(i))+1;
if isempty(i1), continue, end
i2=i1+strfind(tline(i1+1:end),' ')-1;
if isempty(i2), i2=length(tline); end
ret(i)=sscanf(tline(i1:i2),'%f');
end
end
One can make just a couple of refinements to the original --
function XYZ = coordinatesCHAR(tline,matchWords)
% Regulor expression to find matchcase letter.
XYZ=nan(1,length(matchWords));
[a,b] = regexp(tline,'[+-]?\d+(\.\d+)?');
if isempty(a), return, end % no tokens found; return
for ii = 1:length(matchWords)
isfind = strfind(tline,matchWords{ii});
if isempty(isfind), continue, end
% If isfind has more than one component take the first position
strPos = find(a==isfind(1)+1);
if isempty(strPos), continue, end
XYZ(1,ii)=sscanf(tline(a(strPos):b(strPos)),'%f'); % Get the value upto next character
end
end
The above just rearranges the logical tests a little and elimates the duplicate storing of a NaN for missing variable that was in the else clause since the array has already been initialized.
> tic;for n=1:10000;for i=1:numel(txt),A=coordinatesCHAR0(txt{i},vars);end;end;toc
Elapsed time is 1.976065 seconds.
>> tic;for n=1:10000;for i=1:numel(txt),A=coordinatesCHAR4(txt{i},vars);end;end;toc
Elapsed time is 1.797489 seconds.
>>
"0" is the above modified original, "4" is mine last submittal above...

7 Comments

Wow, first of all I want to thank you for all the help you have offered me, I have indeed tested your function (4) on the file I have and it is indeed much faster than the original.
Thank you very much for all the help you have given me.
You're welcome...glad to try to help.
I'm somewhat surprised to hear "much faster" -- some I would expect but only like 10-20% based on the limited timings I did here.
What kind of difference did you actually see, just out of curiosity?
Exactly! it is a difference of approximately 15%, but realize that in one line of text you are not able to appreciate it, however when there are 300000 lines and the time is around 50 seconds you begin to appreciate that 15% less is much faster.
OK. That seems pretty slow; this is a certainly nothing out of the ordinary 10-yo mid-lower range system.
The timing above was 10,000 * 4 lines --> 40,000 lines @ 2 sec; that would extrapolate to only about 15 sec., not 50.
What else is going on there; it isn't by any chance dynamically allocating the output array or something similar is it?
If I change the timing loop slightly
>> A=[];tic;for n=1:10000;for i=1:numel(txt),A=[A;coordinatesCHAR4(txt{i},vars)];end;end;toc
Elapsed time is 5.720677 seconds.
>> whos A
Name Size Bytes Class Attributes
A 40000x3 960000 double
>>
the time is almost 3X what it is otherwise. Well, all have to do is put that into yet another loop and save some numbers to produce...
that shows if do preallocate the timing is linear instead of exponential.
The above was
for N=[1 2 5 10 15 20]*1000;
A=zeros(N*4,3);
tic;
for n=1:N;
j=0;
for i=1:numel(txt),
j=j+1;
A(j,:)=coordinatesCHAR4(txt{i},vars);
end;
end;
toc
end
for the preallocation case,
No, the matrices are predefined to zeros so that this does not happen. I was testing and for the function I had initially made I would have a processing time of 5.61 seconds when the number of lines in the file is 200000 (approximately) while using your function would have a 2 second improvement in time, which is very good.
Thank you very much for everything.
No problem...glad to help -- and glad to hear the performance is better and that the preallocation step wasn't overlooked. Figured worth checking on.
i2=i1+strfind(tline(i1+1:end),' ')-1;
if isempty(i2), i2=length(tline); end
If you can assure there is at least one blank at the end of the line, the above test/fixup could be eliminated. Whether it would speed up the result much or not I don't know, didn't try it.
I debated adding a blank just to be sure but didn't try that, either...

Sign in to comment.

Categories

Products

Release

R2020b

Asked:

on 30 Jan 2021

Commented:

dpb
on 1 Feb 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!