Need to speed up a regexprep implementation

Hi All,
I use the following MATLAB code to parse a large text file with *'s used as repeat symbols (courtesy of previous MATLAB advice).
% Expand repeats
fun = @(n,c)repmat(sprintf(' %s ',c),1,str2double(n));
n=find(contains(file,'*'));
for m=n; file(m) = regexprep(file(m),'\s*(\d+)\*(\S+)','${fun($1,$2)}'); end
clear n m fun;
An example (small) file is as follows. The single file for VAR has 1,555,181 lines and will expand to 37,720,320 values. The full file (VAR + other variables) is a 2145956x1 string array which saves to a 100MB .mat file, so a bit too large to post. The regexprep takes about 10 minutes, and is the slowest part of the file read.
Can anyone (more experienced user!) suggest a faster method of parsing the data records?
My thanks!
Mike King
p.s. As requested, a small sample file (ZCORN.zip, about 10% of a real example) is now attached.
VAR
3294.03 2*3293.74 2*3293.45 2*3293.15 3292.93 3371.97 2*3376.36 2*3380.67 2*3384.95 2*3389.14 2*3393.22 2*3397.16 2*3400.97
2*3404.64 2*3408.21 2*3411.7 2*3415.13 2*3418.49 2*3421.92 2*3425.18 2*3428.28 2*3431.28 2*3434.12 2*3436.84 2*3439.46 3441.96
3441.96 2*3444.37 2*3446.73 2*3448.97 2*3451.2 2*3453.35 2*3455.48 2*3457.6 2*3459.71 2*3461.81 2*3463.92 2*3466.09 3468.29
3468.29 2*3470.52 2*3472.85 2*3475.24 2*3477.72 2*3480.28 2*3482.92 2*3485.67 2*3488.52 2*3491.45 2*3494.45 2*3497.53 3500.67
3500.67 2*3503.84 2*3507.07 2*3510.32 2*3513.63 2*3516.96 2*3520.36 2*3523.76 2*3527.22 2*3530.75 2*3534.3 2*3537.84 3541.26
3541.26 2*3544.57 2*3547.68 2*3550.42 2*3552.83 2*3554.56 2*3555.71 2*3556.22 2*3556.05 2*3555.35 2*3553.85 2*3551.87 3549.46
3549.46 2*3546.73 2*3543.75 2*3540.54 2*3537.31 2*3534.16 2*3531.15 2*3528.34 2*3525.79 2*3523.43 2*3521.4 2*3519.76 3518.43
3518.43 2*3517.4 2*3516.81 2*3516.51 2*3516.62 2*3517.12 2*3517.97 2*3519.17 2*3520.69 2*3522.5 2*3524.51 2*3526.68 3528.91
3528.91 2*3531.24 2*3533.62 2*3536.07 2*3538.58 2*3541.21 2*3543.93 2*3546.77 2*3549.71 2*3552.75 2*3555.9 2*3559.13 3562.45
8153.47 2*8155.84 2*8158.5 2*8161.45 2*8164.68 2*8168.03 2*8171.71 2*8175.57 2*8179.59 2*8183.75 2*8188.04 2*8192.41 8196.84
8196.84 2*8201.31 2*8205.94 2*8210.66 2*8215.46 2*8220.35 2*8225.3 2*8230.37 2*8235.46 2*8240.67 2*8245.88 2*8251.14 8256.38
8256.38 2*8261.66 8267.28 8281.77 2*8282.08 2*8282.04 2*8282.01 2*8281.99 2*8281.96 2*8281.92 2*8281.85 2*8281.79 2*8281.77
2*8281.8 2*8281.95 2*8282.23 2*8282.73 2*8283.63 8284.38 /

7 Comments

@Mike: please upload a sample file by clicking the paperclip button.
Why are you using the FOR loop?
Stephen,
Now that's a very interesting question. When you originally helped me with this, regexprep was passed the entire file at once, and the performance was not so good. For most of my data sets, only a small fraction of records used the "*" and each record that did only had a few replacements. Splitting the problem up with find('*') gave regexprep more but smaller records to work with, and performance improved. The current data set has many more repeats per record, and performance has dropped.
I'm going to follow the implicit lead from your question and construct several variations to see how each performs, and then I'll post what I learn for advice / improvement.
Mike
p.s. I hit the 10 MB limit for uploads with my existing data set. I'll generate a smaller example after making the code modifications.
% Here is updated code with four different implmentations, with four different values of dt
% included as comments. record.mat attached with just one variable (ZCORN), so run times will be
% less than the dt's cited.
%
% Advice & suggestions are welcome!
%
% Mike
%
file=record;
% Expand repeats
fun = @(n,c)repmat(sprintf(' %s ',c),1,str2double(n));
% test regexprep implementations
file1=join(file);
file2=file;
file3=split(file1);
% Expand repeats
tic;
file1 = regexprep(file1,'\s*(\d+)\*(\S+)','${fun($1,$2)}');
dt=toc % dt=316.77
% Expand repeats
tic;
file = regexprep(file,'\s*(\d+)\*(\S+)','${fun($1,$2)}');
dt=toc % dt=305.92
% Expand repeats
tic;
n=find(contains(file2,'*'));
for m=n; file2(m) = regexprep(file2(m),'\s*(\d+)\*(\S+)','${fun($1,$2)}'); end
dt=toc % dt=316.62
% Expand repeats
tic;
n=find(contains(file3,'*'));
for m=n; file3(m) = regexprep(file3(m),'\s*(\d+)\*(\S+)','${fun($1,$2)}'); end
dt=toc % dt=401.44
"Can anyone (more experienced user!) suggest a faster method of parsing the data records?"
I have done a fair bit of playing around with dynamic regular expressions (e.g. words2num)... they are very useful, but not fast. I would recommend trying avoiding the dynamic function call, perhaps by somehow converting the REPMAT into either a pure regular expression (not dynamic) or pure MATLAB code.
Here is one approach:
str = '3294.03 2*3293.74 2*3293.45 2*3293.15 3292.93 3371.97 2*3376.36 2*3380.67 2*3384.95 2*3389.14 2*3393.22 2*3397.16 2*3400.97';
[T,S] = regexp(str,'\s*(\d+)\*(\S+)','tokens','split');
S = reshape(S,1,[]);
T = vertcat(T{:});
F = @(n,c)repmat(sprintf(' %s ',c{:}),1,n);
S(2,1:end-1) = arrayfun(F,str2double(T(:,1)),T(:,2),'uni',0); % FOR loop would be faster
out = sprintf('%s',S{:})
out = '3294.03 3293.74 3293.74 3293.45 3293.45 3293.15 3293.15 3292.93 3371.97 3376.36 3376.36 3380.67 3380.67 3384.95 3384.95 3389.14 3389.14 3393.22 3393.22 3397.16 3397.16 3400.97 3400.97 '
I will have a think about approaches using regular expressions.
Question: is there a limit to the value of n used in REPMAT? If so, what is that limit?
Q: is there a limit to the value of n used in REPMAT? If so, what is that limit?
A: The use case is based on parsing models provided by others, so there's no intrinsic limit to the number of records or to the number of repeats. The largest value for n that I reach routinely is somewhat in excess of 1 million. For my most problematic model, there are approximately 18 million pattern matches. BTW, your guidance is correct: the regexp time is not bad (considering), e..g,:
[startIndex,endIndex] = regexp(record,'(\d+)\*(\S+)');
dt=9.8 seconds. It's the subsequent parsing/expansion that takes the most time.
dt: 528.8494 <= Stephen's suggested approach
dt: 517.9332 <= Stephen's suggested approach with modified types
dt: 504.0396 <= Similar to Stephen's approach but using named tokens
dt: 313.5491 <= Brute force using regexp & extractBetween to extract substrings
dt: 273.4158 <= Original solution using regexprep
Above are the timings for my most problematic data set. Example code follows.
Recommendation to self: Looks like we're already close to the technical limit...
My thanks for the suggestion!
Mike
tic;
[T,S] = regexp(record,'(\d+)\*(\S+)','tokens','split');
S = reshape(S,1,[]);
T = vertcat(T{:});
F = @(n,c)repmat(sprintf(' %s ',c{:}),1,n);
% S(2,1:end-1) = arrayfun(F,str2double(T(:,1)),T(:,2),'uni',0); % FOR loop would be faster
% test0 = sprintf('%s',S);
R = arrayfun(F,str2double(T(:,1)),T(:,2),'uni',0);
test0 = strjoin(S,R);
dt=toc;
disp("dt: "+dt)
tic;
[T,S] = regexp(record,'(\d+)\*(\S+)','tokens','split');
R=strings(size(T));
for n=1:numel(R); R(1,n)=join(repmat(T{1,n}(2),1,str2double(T{1,n}(1)))); end
test1 = strjoin(S,R);
dt=toc;
disp("dt: "+dt)
tic;
[N,S] = regexp(record,'(?<rep>\d+)\*(?<val>\S+)','names','split');
R=strings(size(N)); rep=zeros(size(R));
for n=1:numel(N); R(n)=N(n).val; rep(n)=N(n).rep; end
for n=1:numel(N); R(n)=join(repmat(R(n),rep(n),1)); end
test2 = strjoin(S,R);
dt=toc;
disp("dt: "+dt)
tic;
[startIndex,endIndex] = regexp(record,'(\d+)\*(\S+)');
nr = numel(endIndex);
S=strings(1,nr+1); R=strings(2,nr);
S(1) = extractBefore(record,startIndex(1));
S(end) = extractAfter(record,endIndex(end));
for n=2:nr; S(n)=extractBetween(record,endIndex(n-1)+1,startIndex(n)-1); end
for n=1:nr; R(:,n)=split(extractBetween(record,startIndex(n),endIndex(n)),"*"); end
reps=str2double(R(1,:)); R=R(2,:);
for n=1:nr; R(n)=join(repmat(R(n),reps(n),1)); end
test3 = strjoin(S,R);
dt=toc;
disp("dt: "+dt)
tic;
fun = @(n,c)repmat(sprintf(' %s ',c),1,str2double(n)); %#ok<NASGU>
record = regexprep(record,'(\d+)\*(\S+)','${fun($1,$2)}');
dt=toc;
disp("dt: "+dt)
I wanted to thank Stephen23 again, and post the version of his solution that I now use routinely. There are two different repeat patterns in the string file records I'm parsing (N* in smaller records and N*X in both large and small records). The function solution works very well for both.
Mike
fun = @(n,c)repmat(sprintf('%s ',c),1,sscanf(n,"%u")); %#ok<NASGU>
if contains(record,"*")
record = regexprep(record,'(\d+)\*\s+','${fun($1,"1*")}');
record = regexprep(record,'(\d+)\*(\S+)\s+','${fun($1,$2)}');
end

Sign in to comment.

Answers (1)

Hi Mike,
I understand that you are trying to implement regexprep in MATLAB and the large size of the data file carrying your records takes a lot of time to process and parse the data records.
Please know that parsing large text files can be time consuming, especially when using regular expressions. In this case, to improve the performance of parsing data records, there are a couple of methodologies that can be used while processing data and make parsing faster. You can try out the following approaches to speed up the process:
  • Try “vectorization and reading the file record by record, instead of loading it all at once. Loading data to preallocated memory can also save some time that dynamic allocation of memory during runtime would consume. You can refer the following sample code snippet to perform this operation:
fid = fopen('your_file.txt', 'r');
data = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
file = data{1};
expandedFile = cell(size(file));
for i = 1:numel(file)
fields = strsplit(file{i}, ' ');
expandedFile{i} = repmat(fields{2}, 1, str2double(fields{1}));
end
  • Faster performance can be achieved by using the “fread function in MATLAB to read the binary data directly from the file. This approach avoids the overhead of text parsing and can significantly improve the processing speed. The following code snippet demonstrates the same: fid = fopen('your_file.txt', 'r'); binaryData = fread(fid, Inf, 'uint8=>char')'; fclose(fid); The fread function reads the entire file as binary data and stores it in the binaryData variable. The Inf argument specifies that it should read until the end of the file. The uint8=>char conversion is used to interpret the binary data as characters.
  • Parallel processing can also be considered on processing MATLAB to leverage multiple CPU cores and sspeed up the parsing process. “parfor” loop can be used instead of regular “for” loop to access parallel looping over multiple records. Please refer to the following MATLAB documentation link for more information on “parfor”: https://in.mathworks.com/help/parallel-computing/parfor.html
I hope this helps.

1 Comment

Thank you for these suggestions. (Sorry that I missed them ... I was travelling in August when your reply was posted.) I'm not sure that I understand how your suggestions actually help with my performance issue of finding and replacing strings of the form N*X with N copies of X. If you'd be kind enough to expand on your reply it would be appreciated.

Sign in to comment.

Products

Release

R2022a

Asked:

on 9 Jun 2023

Edited:

on 11 Apr 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!