MATLAB Answers

Importing a Large CSV in chunks and getting Data Range Invalid

13 views (last 30 days)
John Gagnon
John Gagnon on 25 Apr 2019
Answered: Jeremy Hughes on 25 Apr 2019
I have a large CSV File with 17+million rows of data. My current code is importing 100,000 lines at a time and processing that data down than overwritting that raw data for the next 100k in an attempt to not run out of memory.
for i=0:100000:10000000
RangeString = sprintf('B%d:O%d', start_row, end_row);
[num,text,raw] = xlsread('ABC.csv', 1, RangeString);
%%Process Data%%
The End Result anytime I try import a line over 1m.
Data range 'B1000002:O1100000' is invalid.
Error in DataAnalysis (line 13)
[num,text,raw] = xlsread('ABC.csv', 1, RangeString);
What is the better way to import and process this data?
Walter Roberson
Walter Roberson on 25 Apr 2019
For textscan() you specify a format in a fscanf() like way. You need one format specifier for each input field, and if you do not want the field returned you can use the * modifier:
=> read a number to be returned, read a string but discard it, read a number to be returned
Normally, textscan returns a cell array in which there is one element for each format specifier (that is not marked to discard.) In most uses, that means a cell array that contains one cell per column of input (that you want to keep.) The CollectOutput option triggers merging the output for adjacent elements that use the same datatype.
Dates can be read in as dates with a %D specifier. However, it is tricky to get it to scan a date with a space in it as a single field; it is often easier to scan a date with a space in it as a date field and then a duration field, and add the duration to the date afterwards.

Sign in to comment.

Answers (2)

Bob Thompson
Bob Thompson on 25 Apr 2019
Edited: Bob Thompson on 25 Apr 2019
You can call headerlines to specify a starting point with textscan, but I don't know if it is possible to specify an ending point, which is part of why I suggested dlmread.
Because you have dates, another option you can try is to read each line as a string of text, and then split the values with regexp. That might look something like this.
fid = fopen('mydata.csv');
line = fgetl(fid);
count = 1;
while ~isnumeric(line)
if rem(count,100000)==0; % Look for
tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens'); % Sample only, should return four cells with data
^ ^ ^ ^
Interger Integer DateTime Integer
data(count,:) = [tmp{:}]; % May need to go one level deep {1}{:}, regexp is weird
% Put what you want to do with each block of 100k data lines here. Not really sure what you want to do with it
count = 0;
tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens');
data(count,:) = [tmp{:}];
line = fgetl(fid);
count = count + 1;
This is not a perfect code, and may require you to learn some about regexp to get the formatting correct. I was also unsure what exactly your format looks like.
Alternatively, you could try reading with dlmread, and then concatenating several columns into one for a date time, either using datetime, or just by concatenating.
Keep in mind that no matter which method you use that you will have a set of data with mixed classes, so you will either need to work with cells, a table, or a structure.
Walter Roberson
Walter Roberson on 25 Apr 2019
For textscan() you specify a repeat count immediately after the format string.

Sign in to comment.

Jeremy Hughes
Jeremy Hughes on 25 Apr 2019
I'd reccomend using tabularTextDatastore for this case. It will try to automatically detect formats and handle the reading of sections of the file for you.




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!