Reading a very large text file of an almost regular data with empty value

Hello everybody,
I am trying to import an almost regular matrix into matlab. I used textscan with EmptyValue option to do it.
But it always give a error message 'badly formated string'. I do not understand why. Could you please give me a hand.
Below is the data file. The problem with the text file is:
first, there is empty in it. It would be better if I can get a NaN or 0 to replace the empty at the end
Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult.
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
The matrix I wanna obtain is
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00 NaN
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01 NaN
4.417E-03 1.000E+00 2.200E+05 11135 2.543452E+00 NaN
4.417E-03 1.000E+00 2.200E+05 11138 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11141 NaN NaN
4.417E-03 1.000E+00 2.200E+05 11144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+05 11351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2350 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2356 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2522 NaN NaN
4.417E-03 1.000E+00 2.200E+05 2711 NaN NaN
Ps. the file is very large. Data contains thousands of rows and columns. Should I do some optimisation for reading files? Thanks in advance to help me. Thank you very much.
[EDITED]
Format='%*10E %10E %9f %5d %E %E'
opt = {'EmptyValue',NaN,'CollectOutput',1};
tmp = textscan(fid,Format,'Delimiter','','Whitespace','',opt{:});

 Accepted Answer

Since your matrix is just almost regular, you will not be able to work with textscan easily. I guess the problem with EmptyValue is that there are no delimiters for the empty fields in your file.
I would suggest to use fgetl, and extract digits from the resulting char array with regexp. Then you get a cell array of char arrays, which you can convert to doubles.
For example:
clear
fid = fopen('test.txt','rt');
% maximum number of columns
maxlength = 6;
% preallocation
step = 2;
tm = nan(step,maxlength);
% reading line by line
k = 1;
while ~feof(fid)
thisline = fgetl(fid);
thisline = regexp(thisline,'\d*','match');
thisline = cellfun(@str2num,thisline);
tm(k,1:length(thisline)) = thisline;
r = size(tm,1);
% need to preallocate more?
if k == r
tm((r+1):(r+step), 1:maxlength) = nan;
end
k = k + 1;
end
fclose(fid);
For large files it makes sense to increase step.

8 Comments

The "while ~feof(fid)" method might fail, if the last line is empty. Therefore I always add a check, if the reply of FGETL is a CHAR.
REGEXP('\d*', 'match') splits the string into pure blocks of numbers. For the line '4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01' this is replied:
{'4', '417', '03', '1', '000', '00', '2', '200', '05', '462', '2', '543878', '00', '5', '440884', '01'}.
This is true, so for working with the data as stated above, one has to adjust the regexp expression.
With those, I guess my solution should work, too.
Hello, many thanks for this proposition. However, like Jan Simon said, if I don't know when the values are like '4.417E-03' or '5205' or etc... How could I adjust the regexp expression ?
If the columns would not join, i.e. there would be always at least a white space in between, you could use
regexp(thisline,'\s','split')
Giving it a second thought: it seems, that every column has a fixed number of digits. Maybe you can use this information...
OK, this line should work:
thisline = regexp(thisline,'(\d[.])?\d+([E](-|+)\d\d)?','match');
Hi thanks very much. It works, but I don't understand why your code works also with the joined-columns text file. Is it beacause +([E](-|+)\d\d) this expression to recongonize the E+00 ?
Otherwise, do you know how we can do the optimization for the time to read the text? Since the file has 500*10000 data to be read...
When I run the script in the profiler with a 10000*6-data-file, most time (~20 seconds) is spend in the cellfun line. Only 7 seconds are needed to read the file.
Hence, the conversion which we need due to the data format takes most time.

Sign in to comment.

More Answers (3)

The txt I created is (with a newline character at the end, otherwise won't read properly):
4.417E-03 1.000E+00 2.200E+05 462 2.543878E+00 5.440884E+01
4.417E-03 1.000E+00 2.200E+05 468 2.544193E+00 7.315421E+01
4.417E-03 1.000E+00 2.200E+05 687 2.255183E+00 5.011286E+01
4.417E-03 1.000E+00 2.200E+05 943 7.015397E+00
4.417E-03 1.000E+00 2.200E+05 947 1.877077E+01
4.417E-03 1.000E+00 2.200E+0511135 2.543452E+00
4.417E-03 1.000E+00 2.200E+0511138
4.417E-03 1.000E+00 2.200E+0511141
4.417E-03 1.000E+00 2.200E+0511144 2.543891E+00 4.701584E+01
4.417E-03 1.000E+00 2.200E+0511351 2.255163E+00 4.291446E+01
4.417E-03 1.000E+00 2.200E+05 1591 2.544160E+00 2.182716E+01
4.417E-03 1.000E+00 2.200E+05 1596 2.543892E+00 3.667904E+01
4.417E-03 1.000E+00 2.200E+05 1598
4.417E-03 1.000E+00 2.200E+05 2350
4.417E-03 1.000E+00 2.200E+05 2356
4.417E-03 1.000E+00 2.200E+05 2522
4.417E-03 1.000E+00 2.200E+05 2711
.
fid = fopen('test.txt');
% Import 3rd and 4th column as fixed char, the other directly as doubles
out = textscan(fid,'%f%f%9c%5c%f%f','EmptyValue',NaN);
% Line of blank to pad char fields
pad = repmat(' ',1,numel(out{1}));
% Read in the fixed width char fields as double
out{3} = cell2mat(textscan([pad; out{3}.'],'%f'));
out{4} = cell2mat(textscan([pad; out{4}.'],'%f'));
fclose(fid);
cell2mat(out)

4 Comments

Sorry, but out = textscan(...) outputs incorrect data starting from row 13.
If you copy paste the data as I posted it and add a newline at the end on R2011a Win32 Vista it imports correctly.
Works also on the data posted by gringoire except I have to add a newline at its end.
what about this, can you solve?? this data is 9 column
02:38:00 R- .026 065.457 01** 4862 0097 0074 +19
02:39:00 NaN .000 065.457 01** 4862 0101 0074 +19
02:40:00 NaN .000 065.457 01** 4862 0099 0074 +19
02:41:00 NaN .000 065.457 01** 4862 0097 0074 +19
02:42:00 R- .129 065.459 01** 4862 0111 0074 +19
02:43:00 R- .051 065.460 01** 4862 0099 0074 +19
note: NaN is empty or 2space(" ")
thanks b4

Sign in to comment.

Please post the command, which fails. It is impossible to guess exactly, what you have done.
This is a serious problem: "Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult." It is not trivial to split the string "2.200E+0511135". Without additional restirctions, it is even impossible. I assume, the exponent is limited to 2 digits - can you confirm this? Actually values > 1.0e100 are valid, but then it I'd consider the data file as damaged.
The file uses 4 significant digits per number in the leading columns. This is rather inprecise. I'd never draw any conclusions about the regularity of a large matrix with millions of elements, if the data are represented with such a low accuracy. Do you see any possibilities to obtain the values with SINGLE or better DOUBLE precision?
It is impossible to use TEXTSCAN due to the touching numbers. Therefore you have to insert a space after the 29th character at first and create a new file:
fidIN = fopen(FileName1, 'r');
if fidIN < 0, error('Cannot open file %s', FileName1); end
fidOUT = fopen(FileName2, 'w'); % EDITED: 'r'->'w'
if fidOUT < 0, error('Cannot open file %s', FileName2); end
while 1
S = fgetl(fidIN);
if ~ischar(S)
break;
end
% Split '4.417E-03 1.000E+00 2.200E+0511135'
S = [S(1:29), ' ', S(30:end)];
fwrite(fidOUT, S, 'uchar');
fwrite(fidOUT, 10, 'uchar'); % Or [13,10] for DOS line breaks
end
fclose(fidIN);
fclose(fidOUT);
Then TEXTSCAN has a chance to read the data. You need the 'MultipleDelimsAsOne' and the 'EmptyValue' flags.
PS. Please tell the programer who has created the file, that the format is trashy. It is near to be unusable. The strategy to format ASCII files are easy: Let the file be readable with the minimum number of rules. But you have repeated delimiters, missing data, touching columns and a large data set with low precision.

5 Comments

Thanks first for your comments.
I posted my command in the EDITED.
As you said, I think it is the case that the exponent is limited to 2 digits. All the files I saw is like this. The reason to the file using 4 significant digits at first 4 columns is that the value in the first 4 columns are not important, and they are just indications. that is why the following columns contain more significant digits.
As I am only a student, I don't know who is the programmer. Personnaly, I agreed with you for the format. I think the programer is not professional. He is a civil engineering professor or something like this. The file is due to a finis element analysis code. So what I am working is to realize a MatLab tool to read the results. But really, the format give me a lot troubles.
Thank you very much and it is very nice to meet you....
gringoire
The subject tells "almost regular data" (a synonym for "irregular data"...), but the text mentions "regular matrix". Perhaps my assumptions about the numerical property of regularity are not matching, because you meant the (ir)regularity or the ASCII format.
Yeah sorry for the poor english...
Otherwise, could you please tell me why I have always the error message:
??? Error using ==> textscan Badly formed format string.
Which part I made a mistake..
@gringoire: Talking in English about MATLAB is not easy.
Look at "help textscan" -> Supported FORMAT specifiers: http://www.mathworks.de/help/techdoc/ref/textscan.html
There is no "%E" specifier. You can define the width as in "%10E" in newer MATLAB versions only - e,g, not in 2009a.
Your format specifier belong to FSCANF - which is btw. also a good idea to read the file line by line.
Ah ok.. I thought %10E works for every function-

Sign in to comment.

Hi,
I think Jan's example code is good, but I suppose 3rd line should be:
fidOUT = fopen(FileName2, 'w'); % 'w' like write

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!