Why does csvread behave differently for large csv files?

Question

0 votes

I have two csv files that I'm trying to read in. The first contains one row of integers, the second contains one row of floats.

They are both formatted in the same way (with a trailing comma):

int_val_1,int_val_2,...,int_val_n,
float_val_1,float_val_2,...,float_val_m,

As I understand it, csvread should produce a row matrix with an extra 0 at the end (due to the trailing comma). In my case, however, csvread produces a column matrix without an extra 0 for the first file, and a row matrix with an extra 0 for the second file. This only happens if the first file is large (e.g., 589824 integers). If there are a small number of integers, it behaves as expected.

What's going on?

1 Comment
Show -1 older comments Hide -1 older comments

Jeremy Hughes on 8 Jun 2015

Edited: Jeremy Hughes on 8 Jun 2015

Open in MATLAB Online

Hi Peter,

You have run into an unfortunate limitation in the way csvread detects the number of columns in the file. Since your file is one long row, csvread assumes it's all one never-ending string of data. (at 100,000 columns, as Per discovered below, it stops counting and just returns a column.)

If you want to get consistent results on the output shape, you can call textscan in the following way.

fid = fopen(filename);
[data] = textscan(fid,'%f','Delimiter',',','EndOfLine','\r\n');
fclose(fid);

The variable "data" will be a cell array containing a column of numbers. If you need a row, just pull it out of the cell array and transpose;

data = (data{1})';

I hope this helps,

Jeremy

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

per isakson on 5 Jun 2015

Edited: per isakson on 5 Jun 2015

Open in MATLAB Online

0 votes

I reproduced your result on R2013a, Win7

>> [CR,FS] = cssm(1e5); whos('CR','FS')
  Name           Size             Bytes  Class     Attributes
  CR        100000x1             800000  double              
  FS        100000x1             800000  double              
>> [CR,FS] = cssm(1e3); whos('CR','FS')
  Name         Size              Bytes  Class     Attributes
  CR           1x1001             8008  double              
  FS        1000x1                8000  double

where

function    [ CR, FS ] = cssm( N )
    str = repmat( '1.1,', 1, N );
    fid = fopen( 'cssm.txt', 'w' );
    fprintf( fid, '%s', str );
    fclose( fid );
    CR  = csvread( 'cssm.txt' );
    fid = fopen( 'cssm.txt', 'r' );
    FS  = fscanf( fid, '%f,' );
    fclose( fid );
end

"As I understand it, csvread should produce a row matrix with an extra 0" &nbsp I didn't find that stated in in the documentation of csvread

csvread is based on textscan and contains a bit of automagic. I guess, it was never intended for rows that long, i.e. files without new lines.

&nbsp

without the ending comma

And without the ending comma, cvsread returns a row for the large file.

>> [CR,FS] = cssm(1e5); whos('CR','FS')
  Name           Size                 Bytes  Class     Attributes
  CR             1x100000            800000  double              
  FS        100000x1                 800000  double

&nbsp

textscan with empty formatSpec

csvread calls textscan with formatSpec set to an empty string. That option of textscan is not documented. It makes a difference in this special case.

>> [CR,FS,TS1,TS2] = cssm(1e3); whos('CR','FS','TS1','TS2')
  Name         Size              Bytes  Class     Attributes
  CR           1x1001             8008  double              
  FS        1000x1                8000  double              
  TS1       1000x1                8000  double              
  TS2          1x1001             8008  double              
>> [CR,FS,TS1,TS2] = cssm(1e5); whos('CR','FS','TS1','TS2')
  Name           Size             Bytes  Class     Attributes
  CR        100000x1             800000  double              
  FS        100000x1             800000  double              
  TS1       100000x1             800000  double              
  TS2       100000x1             800000  double

where

function    [ CR, FS, TS1, TS2 ] = cssm( N )
    str = repmat( '1.1,', 1, N );
    fid = fopen( 'cssm.txt', 'w' );
    fprintf( fid, '%s', str(1:end) );
    fclose( fid );
    CR  = csvread( 'cssm.txt' );
    fid = fopen( 'cssm.txt', 'r' );
    FS  = fscanf( fid, '%f,' );
    fclose( fid );
    fid = fopen( 'cssm.txt', 'r' );
    cac = textscan( fid, '%f', 'Delimiter',','            ... 
                  , 'CollectOutput',true, 'EmptyValue',999 );
    fclose( fid );
    TS1 = cac{:};
    fid = fopen( 'cssm.txt', 'r' );
    cac  = textscan( fid, '', 'Delimiter',','               ...
                   , 'CollectOutput',true, 'EmptyValue',999 );
    fclose( fid );
    TS2 = cac{:};
end

&nbsp

For the large file, all of the functions and options I tested fails to recognize the ending comma.

2 Comments
Show None Hide None

Peter on 5 Jun 2015

Open in MATLAB Online

Interesting. While it's not on the documentation page,

help csvread

produces

csvread fills empty delimited fields with zero.  Data files where
    the lines end with a comma will produce a result with an extra last 
    column filled with zeros.

per isakson on 5 Jun 2015

Edited: per isakson on 5 Jun 2015

The test suites at The MathWorks don't always cover all the edge cases, I guess.

Sign in to comment.

Why does csvread behave differently for large csv files?

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

2 Comments
Show None Hide None

More Answers (0)

Categories

Tags

Community Treasure Hunt

Why does csvread behave differently for large csv files?

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

2 Comments Show None Hide None

More Answers (0)

Categories

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

2 Comments
Show None Hide None