Reading a very large text file of an almost regular data with empty value

Question

0 votes

 Hello everybody,
 I am trying to import an almost regular matrix into matlab. I used textscan with EmptyValue option to do it. 
 But it always give a error message 'badly formated string'. I do not understand why. Could you please give me a hand.
 Below is the data file. The problem with the text file is:
 first, there is empty in it. It would be better if I can get a NaN or 0 to replace the empty at the end
 Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult. 
417E-03 1.000E+00 2.200E+05  462   2.543878E+00   5.440884E+01  
417E-03 1.000E+00 2.200E+05  468   2.544193E+00   7.315421E+01 
417E-03 1.000E+00 2.200E+05  687   2.255183E+00   5.011286E+01 
417E-03 1.000E+00 2.200E+05  943   7.015397E+00 
417E-03 1.000E+00 2.200E+05  947   1.877077E+01
417E-03 1.000E+00 2.200E+0511135   2.543452E+00
417E-03 1.000E+00 2.200E+0511138
417E-03 1.000E+00 2.200E+0511141
417E-03 1.000E+00 2.200E+0511144   2.543891E+00   4.701584E+01 
417E-03 1.000E+00 2.200E+0511351   2.255163E+00   4.291446E+01 
417E-03 1.000E+00 2.200E+05 1591   2.544160E+00   2.182716E+01 
417E-03 1.000E+00 2.200E+05 1596   2.543892E+00   3.667904E+01 
417E-03 1.000E+00 2.200E+05 1598 
417E-03 1.000E+00 2.200E+05 2350 
417E-03 1.000E+00 2.200E+05 2356 
417E-03 1.000E+00 2.200E+05 2522 
417E-03 1.000E+00 2.200E+05 2711 
 The matrix I wanna obtain is
417E-03 1.000E+00 2.200E+05   462   2.543878E+00   5.440884E+01  
417E-03 1.000E+00 2.200E+05   468   2.544193E+00   7.315421E+01 
417E-03 1.000E+00 2.200E+05   687   2.255183E+00   5.011286E+01 
417E-03 1.000E+00 2.200E+05   943   7.015397E+00   NaN   
417E-03 1.000E+00 2.200E+05   947   1.877077E+01   NaN
417E-03 1.000E+00 2.200E+05 11135   2.543452E+00   NaN
417E-03 1.000E+00 2.200E+05 11138   NaN            NaN  
417E-03 1.000E+00 2.200E+05 11141   NaN            NaN
417E-03 1.000E+00 2.200E+05 11144   2.543891E+00   4.701584E+01 
417E-03 1.000E+00 2.200E+05 11351   2.255163E+00   4.291446E+01 
417E-03 1.000E+00 2.200E+05  1591   2.544160E+00   2.182716E+01 
417E-03 1.000E+00 2.200E+05  1596   2.543892E+00   3.667904E+01 
417E-03 1.000E+00 2.200E+05  1598   NaN            NaN 
417E-03 1.000E+00 2.200E+05  2350   NaN            NaN 
417E-03 1.000E+00 2.200E+05  2356   NaN            NaN 
417E-03 1.000E+00 2.200E+05  2522   NaN            NaN 
417E-03 1.000E+00 2.200E+05  2711   NaN            NaN

Ps. the file is very large. Data contains thousands of rows and columns. Should I do some optimisation for reading files? Thanks in advance to help me. Thank you very much.

[EDITED]

Format='%*10E %10E %9f %5d %E %E'
opt  = {'EmptyValue',NaN,'CollectOutput',1};
tmp = textscan(fid,Format,'Delimiter','','Whitespace','',opt{:});

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Jacob Palczynski on 25 Aug 2011

Open in MATLAB Online

2 votes

Since your matrix is just almost regular, you will not be able to work with textscan easily. I guess the problem with EmptyValue is that there are no delimiters for the empty fields in your file.

I would suggest to use fgetl, and extract digits from the resulting char array with regexp. Then you get a cell array of char arrays, which you can convert to doubles.

For example:

clear
fid = fopen('test.txt','rt');
% maximum number of columns
maxlength = 6;
% preallocation
step = 2;
tm = nan(step,maxlength);
% reading line by line
k = 1;
while ~feof(fid)
    thisline = fgetl(fid);
    thisline = regexp(thisline,'\d*','match');
    thisline = cellfun(@str2num,thisline);
    tm(k,1:length(thisline)) = thisline;
    r = size(tm,1);
    % need to preallocate more?
    if k == r
        tm((r+1):(r+step), 1:maxlength) = nan;
    end
    k = k + 1;
end
fclose(fid);

For large files it makes sense to increase step.

8 Comments
Show 6 older comments Hide 6 older comments

gringoire on 26 Aug 2011

Otherwise, do you know how we can do the optimization for the time to read the text? Since the file has 500*10000 data to be read...

Jacob Palczynski on 26 Aug 2011

When I run the script in the profiler with a 10000*6-data-file, most time (~20 seconds) is spend in the cellfun line. Only 7 seconds are needed to read the file.

Hence, the conversion which we need due to the data format takes most time.

Sign in to comment.

Answer 2

Oleg Komarov on 25 Aug 2011

Open in MATLAB Online

2 votes

The txt I created is (with a newline character at the end, otherwise won't read properly):

417E-03 1.000E+00 2.200E+05  462   2.543878E+00   5.440884E+01
417E-03 1.000E+00 2.200E+05  468   2.544193E+00   7.315421E+01
417E-03 1.000E+00 2.200E+05  687   2.255183E+00   5.011286E+01
417E-03 1.000E+00 2.200E+05  943   7.015397E+00
417E-03 1.000E+00 2.200E+05  947   1.877077E+01
417E-03 1.000E+00 2.200E+0511135   2.543452E+00
417E-03 1.000E+00 2.200E+0511138
417E-03 1.000E+00 2.200E+0511141
417E-03 1.000E+00 2.200E+0511144   2.543891E+00   4.701584E+01
417E-03 1.000E+00 2.200E+0511351   2.255163E+00   4.291446E+01
417E-03 1.000E+00 2.200E+05 1591   2.544160E+00   2.182716E+01
417E-03 1.000E+00 2.200E+05 1596   2.543892E+00   3.667904E+01
417E-03 1.000E+00 2.200E+05 1598
417E-03 1.000E+00 2.200E+05 2350
417E-03 1.000E+00 2.200E+05 2356
417E-03 1.000E+00 2.200E+05 2522
417E-03 1.000E+00 2.200E+05 2711

.

fid = fopen('test.txt');

% Import 3rd and 4th column as fixed char, the other directly as doubles

out = textscan(fid,'%f%f%9c%5c%f%f','EmptyValue',NaN);

% Line of blank to pad char fields

pad = repmat(' ',1,numel(out{1}));

% Read in the fixed width char fields as double

out{3} = cell2mat(textscan([pad; out{3}.'],'%f'));
out{4} = cell2mat(textscan([pad; out{4}.'],'%f'));
fclose(fid);
cell2mat(out)

4 Comments
Show 2 older comments Hide 2 older comments

Oleg Komarov on 26 Aug 2011

Works also on the data posted by gringoire except I have to add a newline at its end.

Soni huu on 29 Jun 2012

Edited: Soni huu on 29 Jun 2012

what about this, can you solve?? this data is 9 column

02:38:00 R- .026 065.457 01** 4862 0097 0074 +19

02:39:00 NaN .000 065.457 01** 4862 0101 0074 +19

02:40:00 NaN .000 065.457 01** 4862 0099 0074 +19

02:41:00 NaN .000 065.457 01** 4862 0097 0074 +19

02:42:00 R- .129 065.459 01** 4862 0111 0074 +19

02:43:00 R- .051 065.460 01** 4862 0099 0074 +19

note: NaN is empty or 2space(" ")

thanks b4

Sign in to comment.

Answer 3

Jan on 25 Aug 2011

Open in MATLAB Online

1 vote

Please post the command, which fails. It is impossible to guess exactly, what you have done.

This is a serious problem: "Second, between the column 3 and 4, sometimes, the values are attached. It also makes the input difficult." It is not trivial to split the string "2.200E+0511135". Without additional restirctions, it is even impossible. I assume, the exponent is limited to 2 digits - can you confirm this? Actually values > 1.0e100 are valid, but then it I'd consider the data file as damaged.

The file uses 4 significant digits per number in the leading columns. This is rather inprecise. I'd never draw any conclusions about the regularity of a large matrix with millions of elements, if the data are represented with such a low accuracy. Do you see any possibilities to obtain the values with SINGLE or better DOUBLE precision?

It is impossible to use TEXTSCAN due to the touching numbers. Therefore you have to insert a space after the 29th character at first and create a new file:

fidIN  = fopen(FileName1, 'r');
if fidIN < 0, error('Cannot open file %s', FileName1); end
fidOUT = fopen(FileName2, 'w');  % EDITED: 'r'->'w'
if fidOUT < 0, error('Cannot open file %s', FileName2); end
while 1
  S = fgetl(fidIN);
  if ~ischar(S)
    break;
  end
  % Split '4.417E-03 1.000E+00 2.200E+0511135'
  S = [S(1:29), ' ', S(30:end)];
  fwrite(fidOUT, S,  'uchar');
  fwrite(fidOUT, 10, 'uchar');  % Or [13,10] for DOS line breaks
end
fclose(fidIN);
fclose(fidOUT);

Then TEXTSCAN has a chance to read the data. You need the 'MultipleDelimsAsOne' and the 'EmptyValue' flags.

PS. Please tell the programer who has created the file, that the format is trashy. It is near to be unusable. The strategy to format ASCII files are easy: Let the file be readable with the minimum number of rules. But you have repeated delimiters, missing data, touching columns and a large data set with low precision.

5 Comments
Show 3 older comments Hide 3 older comments

Jan on 25 Aug 2011

@gringoire: Talking in English about MATLAB is not easy.

Look at "help textscan" -> Supported FORMAT specifiers: http://www.mathworks.de/help/techdoc/ref/textscan.html

There is no "%E" specifier. You can define the width as in "%10E" in newer MATLAB versions only - e,g, not in 2009a.

Your format specifier belong to FSCANF - which is btw. also a good idea to read the file line by line.

gringoire on 26 Aug 2011

Ah ok.. I thought %10E works for every function-

Sign in to comment.

Answer 4

Antti on 25 Aug 2011

Open in MATLAB Online

1 vote

Hi,

I think Jan's example code is good, but I suppose 3rd line should be:

fidOUT = fopen(FileName2, 'w'); % 'w' like write

1 Comment
Show -1 older comments Hide -1 older comments

Jan on 25 Aug 2011

Yes. Fixed now.

Sign in to comment.

Reading a very large text file of an almost regular data with empty value

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

8 Comments
Show 6 older comments Hide 6 older comments

More Answers (3)

4 Comments
Show 2 older comments Hide 2 older comments

5 Comments
Show 3 older comments Hide 3 older comments

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Tags

Community Treasure Hunt

Reading a very large text file of an almost regular data with empty value

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

8 Comments Show 6 older comments Hide 6 older comments

More Answers (3)

4 Comments Show 2 older comments Hide 2 older comments

5 Comments Show 3 older comments Hide 3 older comments

1 Comment Show -1 older comments Hide -1 older comments

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

8 Comments
Show 6 older comments Hide 6 older comments

4 Comments
Show 2 older comments Hide 2 older comments

5 Comments
Show 3 older comments Hide 3 older comments

1 Comment
Show -1 older comments Hide -1 older comments