How do I parse mixed, dynamic binary & string files?

3 views (last 30 days)

I'm having trouble parsing files that are mixed strings & numbers. The data I need is in 3 columns (comma-delimited), and isn't in any specific form. It can have 10 leading gibberish characters, it might have some letters, symbols, numbers, it might say "Error: hrere" on 1 line (for example). I've used textscan, strread, !strings, fgetl with strread, but I can't seem to get what I need out into 3 variables [Col1 Col2 Col3]. Been racking my brain...How can I do this! Here's a sample of the file:

@#*%;AJ))&3#* a) 24.568, 34.1024, -0.1023

&$@!*(!( (*&Y$)@ 24.568, 34.1020, -0.0888

()(@E$@!*(!( (*&Y$)@ 23.568, 34.1020, -0.0888

$64&$@!*(!( (*&Y$)@ 24.4568, 34.0020, -0.0888

Bad Command

$64&$@!*(!( (*&Y$)@ 24.4568, 34.0020, -0.0888

$64&$@!*(!( (*&Y$)@ 24.4568, 34.0020, -0.0888

&!)*~*(ER!( (*6&Y$)@ 24.568, 34.1020, -0.0888

(*!$)^@ 23.568, 34.1020, -0.0888

etc....

The closest I got was using something like:

fid = fopen('file.txt','r');
tline = fgetl(fid);
[c1 c2 c3] = strread(tline,'%f','delimiter',',');
fclose(fid);

but I can't iterate it, and it quits also if I read a line with a bad string (non-floating point)

  1 Comment
dpb
dpb on 17 Jan 2014
That's a {insert proverbial adjective here}...
It looks like the one consistent thing is that there's a blank preceding the first floating point value for the lines with valid data. I'd probably try to locate that on each line by a rear-to-front search for the second comma delimiter location and then the preceding blank prior to that, then try to convert that substring.
It'll take quite a lot of logic to then add to the point of being able to handle all the other special cases you find I suspect.
Once in a former life had the problem of processing large amounts of data returned from power plant monitoring computer via punch paper tape that was always rife with mispunches and the like...it was a similar lot of work to write a reasonably robust processor to salvage them. From that experience, "good luck".
regexp may also be your friend here...

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 17 Jan 2014
fid = fopen('file.txt','r');
datacell = textscan('%s%s%s', 'Delimiter', ',');
fclose(fid);
col1s = regexprep( datacell{1}, '^.*\s', '' );
Col1 = str2double(col1s);
Col2 = str2double(datacell{2});
Col3 = str2double(datacell{3});
There will be NaN in any entry that did not match the proper format.
  2 Comments
Tom W
Tom W on 23 Jan 2014
Kudos! I was able to use another method, however, your suggestion worked in much fewer lines of code than what I was using. To learn, I don't quite understand the '^.*\s' syntax, what does that interpretively say or tell the program? I'm wondering if that syntax would be of use elsewhere if I understand it better. Thanks again!
Walter Roberson
Walter Roberson on 23 Jan 2014
'^.*\s' is a regular expression, which is a pattern that needs to be matched. The '^' means that the match must occur at the beginning of a line. The . means to match any one character. The * modifier after the . means to extend the previous specification (the dot) as far as possible to the right such that the rest of the pattern afterwards is still satisfied -- so to gobble as many characters as you can such that the rest still works. The \s means any one whitespace character (such as a blank). Re-interpreting this, it means to start at the beginning, find the last space in the string, and take everything from the beginning up to and including that space.
This pattern is inside a regexprep() call, which says to replace the matched string with what is described in the next argument. The next argument I gave is '' which is the empty string. So the effect is to delete all characters from the beginning of the line up to and including the final space, leaving the last series of non-blank characters alone. In other words, to cut out everything except the last column.

Sign in to comment.

More Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!