Importing a Large CSV in chunks and getting Data Range Invalid

Question

John Gagnon on 25 Apr 2019

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/458507-importing-a-large-csv-in-chunks-and-getting-data-range-invalid

Answered: Jeremy Hughes on 25 Apr 2019

I have a large CSV File with 17+million rows of data. My current code is importing 100,000 lines at a time and processing that data down than overwritting that raw data for the next 100k in an attempt to not run out of memory.

for i=0:100000:10000000
    start_row=2+i;
    end_row=100000+i; 
    RangeString = sprintf('B%d:O%d', start_row, end_row); 
    [num,text,raw] = xlsread('ABC.csv', 1, RangeString);
 %%Process Data%%
End

The End Result anytime I try import a line over 1m.

Data range 'B1000002:O1100000' is invalid.

Error in DataAnalysis (line 13)

[num,text,raw] = xlsread('ABC.csv', 1, RangeString);

What is the better way to import and process this data?

6 Comments
Show 4 older commentsHide 4 older comments

Walter Roberson on 25 Apr 2019

You might want to look at https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.tabulartextdatastore.html -- though it appears to use textscan() in part.

Walter Roberson on 25 Apr 2019

For textscan() you specify a format in a fscanf() like way. You need one format specifier for each input field, and if you do not want the field returned you can use the * modifier:

%f%*s%f

=> read a number to be returned, read a string but discard it, read a number to be returned

Normally, textscan returns a cell array in which there is one element for each format specifier (that is not marked to discard.) In most uses, that means a cell array that contains one cell per column of input (that you want to keep.) The CollectOutput option triggers merging the output for adjacent elements that use the same datatype.

Dates can be read in as dates with a %D specifier. However, it is tricky to get it to scan a date with a space in it as a single field; it is often easier to scan a date with a space in it as a date field and then a duration field, and add the duration to the date afterwards.

Sign in to comment.

Sign in to answer this question.

Answer 1

Bob Thompson on 25 Apr 2019

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/458507-importing-a-large-csv-in-chunks-and-getting-data-range-invalid#answer_372273

Edited: Bob Thompson on 25 Apr 2019

You can call headerlines to specify a starting point with textscan, but I don't know if it is possible to specify an ending point, which is part of why I suggested dlmread.

Because you have dates, another option you can try is to read each line as a string of text, and then split the values with regexp. That might look something like this.

fid = fopen('mydata.csv');
line = fgetl(fid);
count = 1;
while ~isnumeric(line)
    if rem(count,100000)==0; % Look for 
        tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens'); % Sample only, should return four cells with data
                            ^        ^            ^            ^
                          Interger Integer    DateTime      Integer
        data(count,:)  = [tmp{:}]; % May need to go one level deep {1}{:}, regexp is weird
        % Put what you want to do with each block of 100k data lines here. Not really sure what you want to do with it
        count = 0;
    else
        tmp = regexp(line,'(\d+)\s+(\d+)\s+(\d+\s\d+\s\d+)\s+(\d+)','tokens');
        data(count,:)  = [tmp{:}];
    end
    line = fgetl(fid);
    count = count + 1;
end

This is not a perfect code, and may require you to learn some about regexp to get the formatting correct. I was also unsure what exactly your format looks like.

Alternatively, you could try reading with dlmread, and then concatenating several columns into one for a date time, either using datetime, or just by concatenating.

Keep in mind that no matter which method you use that you will have a set of data with mixed classes, so you will either need to work with cells, a table, or a structure.

2 Comments
Show NoneHide None

John Gagnon on 25 Apr 2019

Thank you. I will take a look at regexp and see if I can get this to work!

Walter Roberson on 25 Apr 2019

For textscan() you specify a repeat count immediately after the format string.

Sign in to comment.

Answer 2

Jeremy Hughes on 25 Apr 2019

0
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/458507-importing-a-large-csv-in-chunks-and-getting-data-range-invalid#answer_372298

I'd reccomend using tabularTextDatastore for this case. It will try to automatically detect formats and handle the reading of sections of the file for you.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Importing a Large CSV in chunks and getting Data Range Invalid

6 Comments
Show 4 older commentsHide 4 older comments

Answers (2)

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Importing a Large CSV in chunks and getting Data Range Invalid

6 Comments Show 4 older commentsHide 4 older comments

Answers (2)

2 Comments Show NoneHide None

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

6 Comments
Show 4 older commentsHide 4 older comments

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments