Clear Filters
Clear Filters

Faster ways to deal with bigger data (1 to 10 TB ish)

19 views (last 30 days)
There are some thousands of large .csv files (each is 8 GB max.) that I absolutely have to read top to bottom to do basic operations on them (they're in my hard drive, see attachment to get an idea of what's in them). I want to convert them to .mat files after reading them using readtable(), but reading them takes days - I need them fast. Could you help optimize my plan for converting them to a more managable format via MATLAB in a short time using my ~30 USD budget? I'm not expecting y'all to teach me things from scratch or give long answers but if you have any links I could check out or even a single bit of improvement I'd be greatful - just looking for a some direction.
My current plan is to;
1- Upload the .csv files to my cloud strorage from my hard drive
2- Get EC2 instance with ~32GB RAM and download everything there
3- readtable() all of the .csv files in a for loop
4- convert the cell
{"True";"False";..;"True"}
columns to 1s and 0s for all tables (which would make everything a double)
5- split doubles by their columns for faster access in the future
6- save all (column) doubles as .mat files with a simple filename convention
7- upload all .mat files back to my cloud storage
8- download them back to my hard drive
Note 1: I have relatively fast upload/download speed but my PC overheats so I can't really split the files and read them manually without breaking something - hence the cloud + download idea, but open to suggestions otherwise.
Note 2: The 4th and 5th columns aren't always the same as each other, the 7th and 8th aren't always true or always false respectively. They're all random.
  4 Comments
Ive J
Ive J on 26 Oct 2021
tall datastores can be much faster than readtable when you're dealing with big data. Consider the following:
ds = tabularTextDatastore('sample.txt', 'TextType', 'string'); % handling strings are much more convenient than cell arrays of char
% do other modification on the datastore
ds = tall(ds);
% do QC, filtering, etc steps (you're safe, this step won't affect your RAM usage!):
% e.g:
ds.(7)(ds.(7) == "True") = 1; % similarly for column 8, and for "False"
ds.(7) = logical(double(ds.(7))); % convert to logical
ds = gather(ds); % now read the clean table into memory
% save to mat file: by converting the table into a struct and saving to a
% mat file, the loading/accessing to variables can be easier/more
% efficient: e.g. when you need only second variable, you can just
% Var2 = load("chunk1.mat", 'Var2');
ds = table2struct(ds, 'ToScalar', true);
save("chunk1.mat", '-struct', 'ds')
Can Atalay
Can Atalay on 26 Oct 2021
Thanks a bunch! This will help me big time working through the bigger ones :)

Sign in to comment.

Answers (0)

Categories

Find more on Data Import and Export in Help Center and File Exchange

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!