Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

Question

ndb on 27 Mar 2019

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/452934-textscan-encountering-unwanted-character-how-do-i-kill-that-line-and-move-on-without-killing-the-sc

Edited: ndb on 4 Apr 2019

Accepted Answer: Walter Roberson

ndb_sample.txt

Open in MATLAB Online

Hello:

I'm processing a large temporal dataset (data recorded every minute with 60+ columns). Right now, I'm using textscan() to parse it a bit. Maybe 5 times throughout one file, there is an upside-down question mark (¿) within th data. So, this kills my script because it expects a float, and I'd like to avoid that by skipping the column where it finds that character as well as the remaining data/columns in that textscan line, and treat them as empty. I've attached a few minutes of the data that include good data and one line with the ¿. Here's a bit of ugly code within a loop that deals with that:

filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, ...
    'TextType', 'string', 'ReturnOnError', false,'EmptyValue',-Inf);
fclose(fileID);

I know it's probably not the most efficient way to do it, but it's what I've got now. I've looked a bit into regular expression replacement, but I could never get that to work. Any advice is appreciated.

4 Comments
Show 2 older commentsHide 2 older comments

dpb on 28 Mar 2019

Open in MATLAB Online

How big is the actual file?

With today's memory, I'd be tempted to just load the whole thing in memory and clean up the offending lines, then process.

Or, it's surprisingly fast, just write a quick filter that kills any line if finds with the bum character...or use a standalone grep utility first...

magicchar=char(N);    % whatever the offending character is
fidi=fopen('yourfile.txt','r');
fodo=fopen('newfile.txt,'w');
while ~feof(fidi)
  l=fgets(fidi);
  if contains(l,magicchar),continue,end
  fprintf(fido,'%s')
end
fclose(fidi)
fclose(fido)

ndb on 28 Mar 2019

Edited: ndb on 28 Mar 2019

The files aren't big at all, ~400 KB. The thing is, I'd like to keep the data up to the point where the offending character lives, and then kill everything after that... if possible.

You're right though: I should be reading in the whole file, doing my magic on it, and then writing it out somewhere else. I wasn't doing that to begin with because the format of the file up to a certain date had an even wonkier format that wasn't uniform. Now that the files coming in are uniform, this will definitely be the way to go. Thanks also for the shortcut on the line formatting. That will help in all my other endevours as well.

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 28 Mar 2019

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/452934-textscan-encountering-unwanted-character-how-do-i-kill-that-line-and-move-on-without-killing-the-sc#answer_367797

Edited: Walter Roberson on 28 Mar 2019

Open in MATLAB Online

filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
ncol = 72;
filler = repmat('-inf ', 1, ncol);
S = fileread(filename);
newS = regexprep(S, '\S*\x00.*?$', filler, 'lineanchors');
dataArray = textscan(newS, formatSpec, 'Delimiter', delimiter, 'TextType', 'string');

You already have a %[^\n\r] to eat to the end of line. Typically that will get "0 0 " in it (that is, if you were trying to read all the numeric columns as numeric then your count was off by 2). I take advantage of that eating by detecting the bad characters and substituting an entire full line's worth of -inf pattern, knowing that the -inf will be used by the %f format and that any left-over -inf will be eaten by the %[^\n\r] pattern. You will get a dataArray{end} line that has a number of "-inf" occurances. I figure that if the 0 0 was significant for something that you would have read it with %f%f .

10 Comments
Show 8 older commentsHide 8 older comments

Walter Roberson on 29 Mar 2019

Sure.

\S is a non-space character. * means "any number of them" including none at all. This part is expressing "match all of the non-space characters that are immediately before the bad character -- wiping out the field that the bad character appears in, as requested. (It would have also been valid to consider only wiping out from the bad character onwards. And looking at your data, it looks like you could have considered splitting the records at the bad character, as it looks like the bad character replaces the end of a record and immediately after the character starts a new line.)

Then the \x means that the next two digits are to be interpreted as the hexadecimal representation of the character. So \x00 means char( hex2dec('00')) which is char(0) . The bad character was the null character when I looked in the file.

Then . means "match any character", ordinarily including possibly newline characters. The * means "any number of them". Normally that would extend as far as possible, matching any number of any character, going as far as you can in the file until forced to backtrack to match something else that followed in the regular expression. However, the ? means to instead use as few characters as is needed to match the rest of the expression.

The $ then normally matches end of the string. However, with the 'lineanchors' option, it matches the end of a line instead. In the context of following .*? it effectively modifies from "match everything until end of file" and makes it "match everything until the end of the current line"

Thus, the \x00.*?$ with the 'lineanchors' option means to match from a null character to the end of the same line. And the \S* before that goes back to the beginning of the field that the null was found in.

The 'lineanchors' option is important in this context. It would not hurt to also use the 'dotexceptnewline' option to reinforce that the .*? is not to go past the end of the current line.

regexprep(string, pattern, replacement) finds all occurances of the pattern in the string, and replaces them with the replacement. Thus we are searching for all cases in which null occurs on a line and replacing from the beginning of that field until the end of that line.

What we are replacing with is 72 copies of '-inf '. -inf is what you had previously configured at your EmptyValue option for textscan, so in cases where it found emptiness in a field you wanted -inf returned for that field; that's where the '-inf' comes from.

The code assumes that the null could appear anywhere on the line. If we knew that the null only appeared at one particular location, we could replace it with a bunch of -inf just long enough for that one case, but the code does not assume it got anywhere on the line. It assumes that the null could even be in the first field, and so that it might have to put in all 72 copies of -inf to fill the fields. The code does not bother to try to figure out which field number it is working in, so it does not bother to work out how many fields it needs to replace with -inf on this particular line. Instead the code just puts in all the -inf it could possibly need, and counts on the %[^\n\r] catch-all at the end of the textscan pattern to consume any -inf that were not needed.

Walter Roberson on 29 Mar 2019

Ah, I see it now, the 13.22.282 . Unfortunately, textscan is happy to treat that as 13.22 0.282 without noticing anything wrong. So yes, a fair bit would have to be known about the correct representation of numbers on the system. For example it helps to know for sure that it always puts leading 0. on valid fractional values < 1: some systems would instead leave out the leading 0 and go directly to the period, '0.282' versus '.282' .

Are the numbers certain to have 3 decimal places? And is it certain that a positive number will always have a single space after the comma but a negative value will have no space after the comma?

ndb on 4 Apr 2019

Edited: ndb on 4 Apr 2019

Open in MATLAB Online

Just to close this out: I ended up incoporating readtable() (new to 2109a) into my workflow instead of textscan(). I decided to read in each file, play with it, and write out a file of filtered and processed data. While it's a bit slower, readtable() allowed me to deal with the data in a slightly cleaner way. I also ended up killing each line of data that had offending characters or insufficient number of columns or variables, etc. While this exercise has allowed me to work on my regular expressions, I decided that I don't have time to deal with every little exception and error. For what it's worth, it's the bit of code that uses readtable() to read in my data:

fileList = dir('*.Neph.txt');
varNames = {'Year','Month','Day','Hour','Minute','Sec','Mtrash',...
    'Dtrash','ytrash','Htrash','Mintrash','Sectrash','nm635','nm525',...
    'nm450','back635nm','back525nm','back450nm','SampleTemp',...
    'EnclosureTemp','RH','Pressure','MajorState','DIOState'};
varTypes = {'double','double','double','double','double','double',...
    'double','double','double','double','double','double','double',...
    'double','double','double','double','double','double','double',...
    'double','double','char','char'};
delimiter = {' ','\t',',','/',':','-'};
dataStartLine = 1;
opts = delimitedTextImportOptions('VariableNames',varNames,...
    'VariableTypes',varTypes,...
    'Delimiter',delimiter,...
    'DataLines', dataStartLine,...
    'ConsecutiveDelimitersRule','join',...
    'MissingRule','omitrow',...
    'EmptyLineRule','skip',...
    'ImportErrorRule','omitrow',...
    'ExtraColumnsRule','ignore');
for i = 1:length(fileList)
    if fileList(i).bytes ~= 0
        %read in data
        data = readtable(fileList(i).name,opts);
.
.
.

Thanks to Walter and dpb for the help and suggestions. Cheers

Sign in to comment.

Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

10 Comments
Show 8 older commentsHide 8 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

10 Comments Show 8 older commentsHide 8 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

10 Comments
Show 8 older commentsHide 8 older comments