MATLAB Coder regexp Alternative

12 views (last 30 days)
Christopher McNamara
Christopher McNamara on 23 Mar 2019
Answered: Guillaume on 27 Mar 2019
Hello,
I am attempting to use MATLAB coder to convert a function I have for parsing l large text files for relevant data. I recently posted a related question in regards to the size of these files: https://www.mathworks.com/matlabcentral/answers/448915-large-text-file-datastore. Although the datastore option wasn't the best way for me to parse my data, I was able to sucessfully write a function in read the large ~30 GB + text files and reduce the data out of them.
My reason for using coder is the hope of speeding up the function. The function I have works by reading the text file in blocks but it must loop through each block a number of times looking for relevant data. Because of the data structure format and my lack of control over it, this is the only way I see it as possible to parse the data file. Additionally the function cannot be effectively vectorized. This leads me to believe that the inability to vectorize and multiple for loops will result in a speed up with utilizing codegen.
The problem I have now is converting the function to a format that codegen can use. All of the issues encountered thus far using codegen.screener( ) have been rather easily fixed with the exception of my usage of regexpr( ). My need to use this function arises from the fact that my text file blocks have a lot of "garbage" information present and I only need to extract certain expressions from each block. For example I might have page of data that looks similar to this:
Garbage garbage garbage = 2345 garbage lsasdfasdf
adfasdfasdfasdffas klasdfklnfa asdfasdflkasdf lkasdf
Relevant Data
X 1 = 20.100 X 2 = 30.200 X 3 = 40.100 .....
....
X ij1 = 12.012 X ij2 = 210.20 X ijk = 1000.1
Garbage garbage garbage = 2345 garbage lsasdfasdf
adfasdfasdfasdffas klasdfklnfa asdfasdflkasdf lkasdf
My way of handling this has been to use regexp like this:
load PageData.mat
ValueExpr = '\d+=(\S|\s+)\S+\s';
DataBlocks = regexp(PageData,ValueExpr,'match');
This piece of code will basically return all of the data in the format "X 1 = 20. 100" for all of the numbers from 1 to ijk in a cell array. I can then use strsplit to get the ijk values and measurement values.
This is the main crux of my problem. Although I am using regexp in many locations of the function for other tasks like splitting, trimming, etc. this particular usage of it seen above is the main backbone of the data extraction routine and I am unsure how to match this behavior with the other coder acceptable functions very easily. My best attempt would be to use something like strfind( ) to gather all the indices for "=" and loop through these indicies to get the data but I am not sure if there is a simpler way to get this into a coder acceptable format.
Any ideas would be greatly appreciated.
  4 Comments
Christopher McNamara
Christopher McNamara on 27 Mar 2019
I am using R2019a.
I was able to "cleanse" my code of regexp( ) by using some work arounds and converting to support functions but the ones I have not been able to remove are those which match expression formats, basically using regexp(str, expr, 'match'). I figured I could get around this by using a C++ function that I could call using coder.eval.
I was able to generate the following C++ function that I have been able to test and it seems to do what I want (mimic the behavior of regexp(str,expr,'match'):
#include <iostream>
#include <string>
#include <regex>
#include <vector>
using namespace std;
// Define function to call regex_search to match input variable and expression type
// Outputs a string vector -- std::vector <string>
// Accepts a string to search and string that defines the expression type to match
std::vector <string> regex_var(std::string str, std::string str_expr)
{
// Declare match parameter
std::smatch m;
// Declare regex expression using user supplied expression
std::regex expr(str_expr);
// Initialize match string
std::string str_match;
// Initialize output vector
vector <string> match_strings;
// Loop through and collect every matching expression in a vector until nothing is found
while(regex_search(str, m, expr))
{
// Call search
std::regex_search (str,m,expr);
// Define output
str_match = m.str();
// Update vector with matches
match_strings.push_back(str_match);
// Redefine string to eliminate previously searched
str = m.suffix();
}
// Return output
return match_strings;
}
When trying to pass this through coder.ceval now I can't seem to get this to compile to save the life of me where it fails during the MEX generation. The function I am using to call this in MATLAB is:
function match = regex_match(str,expr) %#codegen
match = '';
coder.updateBuildInfo('addSourceFiles','regex_var.cpp');
match = coder.ceval('regex_var',str,expr);
end
Just by following some of the basic examples. However this is now failing for a variety of reasons.
Firstly I am struggling to figure out how to receive the string output for the variable match because I keep getting the error that "C functions always return scalar values".
If I attempt to supress the assignment to match and just see if the compiler will finish without error, I receive additional error that "implicit declaration of function 'regex_var'" which I interpret as the compiler having issue with the line of my code defining the C++ routine of regex_var(), but I don't understand why either because this code sucessfully compiles and works in C++.
Thanks in advance for any more help you can provide.

Sign in to comment.

Answers (2)

Guillaume
Guillaume on 25 Mar 2019
Like Walter, I was going to suggest delegating to another regular expression engine. Note that in modern C++ (C++ 11 and later), there's no more need for boost::regex, it's now part of the standard as std::regex.
However, you already have access to one or two other regular expression engines directly from Matlab.
You always have access to the java regular expression engine:
pattern = java.util.regex.Pattern.compile('\d+=(\S|\s+)\S+\s');
matcher = pattern.matcher(java.lang.String('garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage'));
matches = {};
while matcher.find
matches = [matches; char(matcher.group)];
end
celldisp(matches)
On windows, you also have access to the .Net regular expression engine:
%in theory you should be able to create a regular expression:
%regex = System.Text.RegularExrpressions.Regex('\d+=(\S|\s+)\S+\s');
%and get a matchcollection
%matchcollection = regex.Matches('garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage');
%but i get a 'no method with match signature' error that I don't understand right now
%Can use static Matches method instead:
matchcollection = System.Text.RegularExpressions.Regex.Matches('garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage', '\d+=(\S|\s+)\S+\s');
matches = arrayfun(@(m) char(matchcollection.Item(m).Value), 0:matchcollection.Count-1, 'UniformOutput', false);
celldisp(matches)
  3 Comments
Christopher McNamara
Christopher McNamara on 25 Mar 2019
I wasn't able to find anything but running the above through the coder interface fails on the usage of the Java or .NET commands. Although it is nice to know that regex is in the base for modern C.
Guillaume
Guillaume on 25 Mar 2019
@Walter, Yes I was thinking about that on my way home, and it's probably not supported. (I don't have coder, so don't know). However, if you're going to call a C++ function, you could still delegate to .Net
@Christopher, as far as I know regular expressions are not part of C. They are part of C++ (different languages despite some similarity). Here is an example of using std::regex. Bear in mind that I wrote that code back in 2013 when C++ was new and have hardly written any C++ since then, so there may be some more modern ways of doing it nowadays.
#include <regex>
int main(){
const char* pattern = "\d+=(\S|\s+)\S+\s"; //nowadays you'd probably use std::string
const char* search = "garbage X 1=20.100 X 2=30.200 X 3=40.100 X 123=12.012 garbage";
const std::regex re(pattern);
char* matched_text = nullprt;
std::cmatch result
if (std::regex_search(search, result, re)){
matched = result.str; //str or str(0) is the full match.
}
//...
Note that each expression engine may differ slightly on how they behave. For example, Matlab's regex is the only engine I've come across where . also matches a \n by default. Your regular expression is sufficicently basic that it should behave the same in all engines.
I'm not entirely sure what you're trying to match with your regex. It doesn't match anything in your example text. The alternation probably slows the regular expression (as it will force backtracking).
But at the end of the day, if you're parsing 50 GB of text data, it will take time regardless of which language it's in. That's a lot of text!

Sign in to comment.


Guillaume
Guillaume on 27 Mar 2019
I know nothing about coder, don't have the toolbox. Reading the documentation of coder.ceval, I find it very incomplete. It mentions that you can call C and C++ functions but only show example for calling C functions, not C++ functions, so it's really not clear if it support calling pure C++.
Despite the similarity C and C++ are two different languages. C++ functions use a different calling convention and support a lot more types as inputs and outputs than C. Now, you can make a C++ function compatible with and callable from C, if you prefix it with an extern "C", however, you're then limited to C return values which are basically indeed just standard integer or floating point types or plain pointers (to struct, arrays, etc.).
Considering that I haven't seen a single example of C++ in the various coder documentation pages that I've looked at, I think that their constant use of C/C++ is misleading, it looks like coder only understand C interfaces. So you would have to replace your std:vector inputs and outputs by pointers (and looking at the doc wrap these in coder.ref or coder.rref for inputs and coder.wref for outputs). So your function would become:
void regex_var(const char* str, const char* pattern, char** matches){ //does matlab use char or wchar?
//can use std::vector within the function but in-out must be C
//...
}
and in matlab it would be something like:
coder.ceval('regex_var', coder.rref(str), coder.rref(pattern), coder.wref(matches));
What I can't figure out from the doc is how you manage the allocation of matches. All examples show preallocation of the output in matlab, which of course doesn't work if you don't know how many matches you're going to get, or the length of each match. It looks to me that coder does not support functions that return variable size arrays.
Again, I know nothing about coder, so take all this with a grain of salt.

Categories

Find more on MATLAB Coder in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!