How to split huge .csv files to multiple .csv files based on size?

I am trying to split huge .csv files (11 GB) that has both combination of text and numbers into mutiple files based on size (0.5 GB each). I tried using some of the answers in the matlab community but no luck
I hope someone can help!

4 Comments

Shoulda' never created that large and unwieldy a text file to begin with... But you now know that, already... :)
Can you guesstimate how many lines are in the files at present? I'd probably just write a filter and copy N lines to a new file in a loop to determine that answer, then just put that in a nested loop until you run out of lines in the original. Lather, rinse, repeat...
I am just processing the data how I got. I did not have a choice to choose file type.
But coming to your question, i used
Nrows = numel(textread(filename,'%1c%*[^\n]'));
and found there is 43,633,337 lines in it. Can you please help me with how to copy N lines to a new file? (I know how to do when ther is just .csv file with numbers but not when there is a combination of text & numbers). Thanks!
Which OS? The easiest way to do this is with the unix split command , quite easy . For Windows I would look at https://www.gdgsoft.com/gsplit/
Thanks Walter, It does work but I had to split many files. The code that user "dbp" recommended in the below comment works great if you want to split files in Matlab.

Sign in to comment.

 Accepted Answer

fidR=fopen('originalfile.csv','r'); % open the big file to read
NFiles=20; % number of files to create
NPerFile=round(Nrows/NFiles)+1; % rough number records / file
for i=1:NFiles
try
fidW=fopen(num2str(i,'fileNew%00d.csv'),'w'); % open a file to write
for j=1:NPerFile
fwrite(fidW,fgets(fidR)); % transcribe lines verbatim
end
fidW=fclose(fidW); % close that one
catch
fidW=fclose(fidW); % close that one
fidR=fclose(fidR);
end
is a poor-man's split tossed off at the console here; the rounding for number of records per file should catch the whole file; the error routine should only occur when the last file runs into feof on the last one if my logic is right.
And, yes, unfortunately, we can't always make others do sensible things about how they collect data we're subsequently given, granted...I recognized that was likely the case, hence the smiley.

4 Comments

Thank you very much!
I need to just modify two simple things from your code to work correctly. The line ' % open a file to write' and 'Nrows'. If I use "num2str" instead of instead of sprintf it was not creating different filenames. So my final code was:
clc, clear all
fidR=fopen('originalfile.csv','r'); % open the big file to read
NFiles=20; % number of files to create
Nrows = numel(textread('originalfile.csv','%1c%*[^\n]'));
NPerFile=round(Nrows/NFiles)+1; % rough number records / file
for i=1:NFiles
try
fidW=fopen(sprintf('fileNew%d.csv',i) ,'w'); % open a file to write
for j=1:NPerFile
fwrite(fidW,fgets(fidR)); % transcribe lines verbatim
end
fidW=fclose(fidW); % close that one
catch
fidW=fclose(fidW); % close that one
fidR=fclose(fidR);
end
end
Ah, yes, "air code" is fraught with typos...
fidW=fopen(num2str(i,'fileNew%02d.csv'),'w');
is correct format string, not '%00d' and you had already defined NRows above... :)
NB: Using
sprintf('fileNew%d.csv',i) % See footnote
will end up with files numbered ...w1,...w2,......w9,...w10,...w11,...w12,...,...w19,...w20,...,
which will come back in alphanumeric order with dir, not in natural order so you'll get ...w1,...w10,...w11,...w12,...,...w19,...w2,...w20,......w9,...,...
probably not the desired result. The '%02d' adds the leading zero so alphanumeric and natural sort order are the same.
(*) To be clear, it's not the difference between sprintf and num2str, it's the format string. I tend to use num2str because it has been vectorized to create an array of strings whereas sprintf runs all the vector output strings together as one long string so that takes additional machinations to separate out when used in vector expressions. So, for self-consistency, I just tend to use the one instead of swapping back and forth based on use context. Just a style preference, basically.
How will this script work on a .csv file with header? Would it be better to remove the header from the big file, do the split, and then add headers on the top line of each of those split files? How can I concatanate headers as the first line of an entire directory of .csv files? Thank
compose() or the undocumented sprintfc() are good at creating separate outputs.

Sign in to comment.

More Answers (0)

Categories

Asked:

on 20 Nov 2018

Edited:

on 13 Feb 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!