Faster ways to deal with bigger data (1 to 10 TB ish)
1 view (last 30 days)
There are some thousands of large .csv files (each is 8 GB max.) that I absolutely have to read top to bottom to do basic operations on them (they're in my hard drive, see attachment to get an idea of what's in them). I want to convert them to .mat files after reading them using readtable(), but reading them takes days - I need them fast. Could you help optimize my plan for converting them to a more managable format via MATLAB in a short time using my ~30 USD budget? I'm not expecting y'all to teach me things from scratch or give long answers but if you have any links I could check out or even a single bit of improvement I'd be greatful - just looking for a some direction.
My current plan is to;
1- Upload the .csv files to my cloud strorage from my hard drive
2- Get EC2 instance with ~32GB RAM and download everything there
3- readtable() all of the .csv files in a for loop
4- convert the cell
columns to 1s and 0s for all tables (which would make everything a double)
5- split doubles by their columns for faster access in the future
6- save all (column) doubles as .mat files with a simple filename convention
7- upload all .mat files back to my cloud storage
8- download them back to my hard drive
Note 1: I have relatively fast upload/download speed but my PC overheats so I can't really split the files and read them manually without breaking something - hence the cloud + download idea, but open to suggestions otherwise.
Note 2: The 4th and 5th columns aren't always the same as each other, the 7th and 8th aren't always true or always false respectively. They're all random.