Loading Large CSV files

136 views (last 30 days)
Alan Kong
Alan Kong on 31 Jul 2015
Edited: per isakson on 1 Aug 2015
Hi Everyone,
I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
  2 Comments
Walter Roberson
Walter Roberson on 31 Jul 2015
Are the fields all numeric?
Alan Kong
Alan Kong on 31 Jul 2015
Yes, only the first row are in string ID like 'R1' to 'R2329'

Sign in to comment.

Answers (2)

Cedric
Cedric on 31 Jul 2015
Edited: Cedric on 31 Jul 2015
If the file contains numbers that you want to get in a numeric array of doubles in MATLAB ultimately, the array will be around 2.5GB. The problem probably comes from the fact that loading the whole file as text, plus processing, plus allocating this array is larger than what your machine can handle.
You can always process the file line by line, or by chunks smaller than the whole file, and populate a 2.5GB numeric array. Something along the following line:
chunk_nRows = 2e4 ;
% - Open file.
fId = fopen( 'largeFile.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
row = sscanf( line, '%f,' )' ;
nCols = numel( row ) ;
% - Prealloc data, copy first row, init loop counter.
data = zeros( chunk_nRows, nCols ) ;
data(1,:) = row ;
rowCnt = 1 ;
% - Loop over rest of the file.
while ~feof( fId )
rowCnt = rowCnt + 1 ;
% - Realloc + a chunk if rowCnt larger than data array.
if rowCnt > size( data, 1 )
fprintf( 'Realloc ..\n' ) ;
data(size(data, 1)+chunk_nRows, nCols) = 0 ;
end
% - Read line, convert and store.
line = fgetl( fId ) ;
data(rowCnt,:) = sscanf( line, '%f,' )' ;
end
% - Truncate data to last row (truncate last chunk).
data = data(1:rowCnt,:) ;
% - Close file.
fclose( fId ) ;
And we can imagine plenty of other ways to read the file by block, e.g. 500MB.
Well, here is another way, which is likely to be more efficient
blockSize = 500e6 ; % Choose large enough so not too many blocks.
tailSize = 100 ; % Choose large enough so larger than one value representation.
% - Open file.
fId = fopen( 'largeFile.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
data = sscanf( line, '%f,' ) ;
nCols = numel( data ) ;
lastBit = '' ;
while ~feof( fId )
% - Read and pre-process block.
buffer = fread( fId, blockSize, '*char' ) ;
isLast = length( buffer ) < blockSize ;
buffer(buffer==10) = ',' ;
buffer(buffer==13) = '' ;
% - Pre-pend last bit of last block.
if ~isempty( lastBit )
buffer = [lastBit; buffer] ; %#ok<AGROW>
end
% - Truncate to last ',' and keep last bit for next iteration.
if ~isLast
n = find( buffer(end-tailSize:end)==',', 1, 'last' ) ;
cutAt = length(buffer) - tailSize + n ;
lastBit = buffer(cutAt:end) ;
buffer(cutAt:end) = [] ;
end
% - Parse.
data = [data; sscanf( buffer, '%f,' )] ; %#ok<AGROW>
end
% - Close file.
fclose( fId ) ;
% - Reshape data vector -> array.
data = reshape( data, nCols, [] )' ;
  6 Comments
Walter Roberson
Walter Roberson on 31 Jul 2015
reading the full file doesn't require that you store the contents.
function numlines = count_remaining_lines(fid);
numlines = 0;
while true
if ~ischar(fgets(fid)); break; end %end of file
numlines = numlines + 1;
end
end
So you
fid = fopen('largeFile.csv', 'rt');
headerline = fgetl(fid);
headerfields = regexp(headerline, ',', 'split');
numcol = length(headerfields);
numrow = count_remaining_lines(fid);
frewind(fid);
fgetl(fid); %re-read and discard the headerline
data = zeros(numrow, numcol);
Now you can loop reading a row at a time, storing it into data(RowNum, :) without having to expand the data buffer.
Cedric
Cedric on 1 Aug 2015
Edited: Cedric on 1 Aug 2015
Alan, if the data is not too confidential, could you run the following code on your file (replace 'largeFile.csv' with your file name) and post a comment with the dump file dump.csv attached?
fId = fopen( 'largeFile.csv', 'r' ) ;
data = fread( fId, 1e6, '*char' ) ;
fclose( fId ) ;
fId = fopen( 'dump.csv', 'w' ) ;
fwrite( fId, data ) ;
fclose( fId ) ;
That will extract a small chunk of your file that we will be able to use (after truncation) to perform tests, and this without you having to open a 6GB file for taking a slice.

Sign in to comment.


Amr Hashem
Amr Hashem on 31 Jul 2015
you can split the file, by any free program.
  2 Comments
Walter Roberson
Walter Roberson on 1 Aug 2015
Angry Birds refuses to split my files.
per isakson
per isakson on 1 Aug 2015
Edited: per isakson on 1 Aug 2015

E.g. GSplit

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!