Clear Filters
Clear Filters

Textscan encountering unwanted character. How do I kill that line and move on without killing the script??

1 view (last 30 days)
Hello:
I'm processing a large temporal dataset (data recorded every minute with 60+ columns). Right now, I'm using textscan() to parse it a bit. Maybe 5 times throughout one file, there is an upside-down question mark (¿) within th data. So, this kills my script because it expects a float, and I'd like to avoid that by skipping the column where it finds that character as well as the remaining data/columns in that textscan line, and treat them as empty. I've attached a few minutes of the data that include good data and one line with the ¿. Here's a bit of ugly code within a loop that deals with that:
filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, ...
'TextType', 'string', 'ReturnOnError', false,'EmptyValue',-Inf);
fclose(fileID);
I know it's probably not the most efficient way to do it, but it's what I've got now. I've looked a bit into regular expression replacement, but I could never get that to work. Any advice is appreciated.
  4 Comments
dpb
dpb on 28 Mar 2019
How big is the actual file?
With today's memory, I'd be tempted to just load the whole thing in memory and clean up the offending lines, then process.
Or, it's surprisingly fast, just write a quick filter that kills any line if finds with the bum character...or use a standalone grep utility first...
magicchar=char(N); % whatever the offending character is
fidi=fopen('yourfile.txt','r');
fodo=fopen('newfile.txt,'w');
while ~feof(fidi)
l=fgets(fidi);
if contains(l,magicchar),continue,end
fprintf(fido,'%s')
end
fclose(fidi)
fclose(fido)
ndb
ndb on 28 Mar 2019
Edited: ndb on 28 Mar 2019
The files aren't big at all, ~400 KB. The thing is, I'd like to keep the data up to the point where the offending character lives, and then kill everything after that... if possible.
You're right though: I should be reading in the whole file, doing my magic on it, and then writing it out somewhere else. I wasn't doing that to begin with because the format of the file up to a certain date had an even wonkier format that wasn't uniform. Now that the files coming in are uniform, this will definitely be the way to go. Thanks also for the shortcut on the line formatting. That will help in all my other endevours as well.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 28 Mar 2019
Edited: Walter Roberson on 28 Mar 2019
filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
ncol = 72;
filler = repmat('-inf ', 1, ncol);
S = fileread(filename);
newS = regexprep(S, '\S*\x00.*?$', filler, 'lineanchors');
dataArray = textscan(newS, formatSpec, 'Delimiter', delimiter, 'TextType', 'string');
You already have a %[^\n\r] to eat to the end of line. Typically that will get "0 0 " in it (that is, if you were trying to read all the numeric columns as numeric then your count was off by 2). I take advantage of that eating by detecting the bad characters and substituting an entire full line's worth of -inf pattern, knowing that the -inf will be used by the %f format and that any left-over -inf will be eaten by the %[^\n\r] pattern. You will get a dataArray{end} line that has a number of "-inf" occurances. I figure that if the 0 0 was significant for something that you would have read it with %f%f .
  10 Comments
Walter Roberson
Walter Roberson on 29 Mar 2019
Ah, I see it now, the 13.22.282 . Unfortunately, textscan is happy to treat that as 13.22 0.282 without noticing anything wrong. So yes, a fair bit would have to be known about the correct representation of numbers on the system. For example it helps to know for sure that it always puts leading 0. on valid fractional values < 1: some systems would instead leave out the leading 0 and go directly to the period, '0.282' versus '.282' .
Are the numbers certain to have 3 decimal places? And is it certain that a positive number will always have a single space after the comma but a negative value will have no space after the comma?
ndb
ndb on 4 Apr 2019
Edited: ndb on 4 Apr 2019
Just to close this out: I ended up incoporating readtable() (new to 2109a) into my workflow instead of textscan(). I decided to read in each file, play with it, and write out a file of filtered and processed data. While it's a bit slower, readtable() allowed me to deal with the data in a slightly cleaner way. I also ended up killing each line of data that had offending characters or insufficient number of columns or variables, etc. While this exercise has allowed me to work on my regular expressions, I decided that I don't have time to deal with every little exception and error. For what it's worth, it's the bit of code that uses readtable() to read in my data:
fileList = dir('*.Neph.txt');
varNames = {'Year','Month','Day','Hour','Minute','Sec','Mtrash',...
'Dtrash','ytrash','Htrash','Mintrash','Sectrash','nm635','nm525',...
'nm450','back635nm','back525nm','back450nm','SampleTemp',...
'EnclosureTemp','RH','Pressure','MajorState','DIOState'};
varTypes = {'double','double','double','double','double','double',...
'double','double','double','double','double','double','double',...
'double','double','double','double','double','double','double',...
'double','double','char','char'};
delimiter = {' ','\t',',','/',':','-'};
dataStartLine = 1;
opts = delimitedTextImportOptions('VariableNames',varNames,...
'VariableTypes',varTypes,...
'Delimiter',delimiter,...
'DataLines', dataStartLine,...
'ConsecutiveDelimitersRule','join',...
'MissingRule','omitrow',...
'EmptyLineRule','skip',...
'ImportErrorRule','omitrow',...
'ExtraColumnsRule','ignore');
for i = 1:length(fileList)
if fileList(i).bytes ~= 0
%read in data
data = readtable(fileList(i).name,opts);
.
.
.
Thanks to Walter and dpb for the help and suggestions. Cheers

Sign in to comment.

More Answers (0)

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!