Subset variable data in one .csv based on values from same variable in a second .csv
Show older comments
I have a large .csv file (~7GB) where the first column is filled with datetimes, and the remaining columns filled with data corresponding to that datetime.
e.g.:
20191025241533, 65.2, 35.2, 75.4
20200122061522, 32.2, 74.5, 61.3
20190411075120, 54.1, 87.0, 22.1
I then have a second smaller .csv which contains a list of datetimes which are of interest to me.
e.g.
20191025241533
20191025241533
I would like to subset the first large .csv based on the datetime values in the second, so I would only keep rows with datetimes present in the second .csv.
Secondly, I would like to use the datetime values in the second smaller .csv to extract files (where the filename is datetime) from a larger folder of files into a separate folder which contains only those files that have the same filename (datetime) as those recorded in the smaller .csv.
Thanks for your help!
11 Comments
Akira Agata
on 21 Apr 2020
If your PC has a sufficient memory to read the large .csv file, the following steps should work:
- By using readtimetable function, read the large .csv file and store the data in timetable variable
- Read the small .csv file and store as datetime variable
- Apply ismember or withtol function to find the corresponding rows in the large .csv file
- Extract the corresponding rows and save as .csv file by writetimetable function
Peter Perkins
on 27 Apr 2020
If the timestamps in the seconds file are exactly the timestamps you are looking for in the first file, you can just subscript into the larger timetable using those datetimes. Something like
TTinteresting = TTbig(interestingTimes,:)
But with 7GB, you may need to resort to using a tall timetable, or perhaps datastores.
The thing with moving the files is just a matter of string and file manipulation.
Louise Wilson
on 28 Apr 2020
Edited: Louise Wilson
on 28 Apr 2020
Peter Perkins
on 30 Apr 2020
csvread is not going to help. You will get a numeric matrix with very large laues such as 20191025241533, which will not be easy to work with. It's not clear what you mean by, "not to let me look at it beyond a certain number of rows".
Subscript by time to get a sub-timetable.
Louise Wilson
on 30 Apr 2020
Edited: Louise Wilson
on 30 Apr 2020
Louise Wilson
on 1 May 2020
Edited: Louise Wilson
on 1 May 2020
Louise Wilson
on 1 May 2020
Peter Perkins
on 5 May 2020
There are several different things going on. You need to solve one at a time.
Again, csvread is not going to help. Don't use it.
The variable editor will not display such a large (7GB) table. That doesn't matter; just use head, or tail, or row subscripting to convince yourself that you have read the data correctly.
Which, it turns out, you haven't. Apparently, the first row of your file contains 0:N. You need to skip that row using HeaderRows.
It's not clear what's really in your files. You descibe the large one as having things like 20191025241533 in the first column, yet in one of your screenshots, it seems that you have (or have converted to) datenums. Don't use datenums. Another screenshot of the large file indicates that you've successfully read the file and have datetimes in your table, that's good. Another screenshot suggests that your small file contains text timestamps?. You seem to have converted those to datetime, that's good, and maybe that's what you did for the large file too. At any rate, it looks like you have a large table and a small table, each of which has a datetime variable.
So, are you not done except for using ismember? If your timestamps match exactly, that's all you need to do:
i = ismember(largeTable.Time,smallTable.Time);
shortenedTable = largeTable(i,:);
If they don't match exactly, then you need to use withtol:
wt = withtol(smallTable.Time,someTolerance);
shortenedTable = largeTable(wt,:);
I don't know what the tolerance would be. Your screenshots indicate that your timestamps are to a resolution of seconds; that suggests that you don't need a tolerance, and should be using ismember.
Louise Wilson
on 6 May 2020
Louise Wilson
on 6 May 2020
Peter Perkins
on 13 May 2020
They are almost certainly not off by 1 sec, but rather off by a small negative amount, and the display is off. Add .SSS to the3 display format and you will see. Because datenum counts in "days", they incur round-off for almost every value that is not a whole day, or at some "nice" whole hour. It's that round-off that is causing this. datetime tries to account for that, but it can't possibly be perfect.
So ... this is one of the reasons why not to use datenums. datetime has been around since R2014b. It's a better choice. If you have no control over the file, and you know the time stamps are to a resolution of whole seconds, use dateshift to round your datetimes.
Answers (0)
Categories
Find more on Dates and Time in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!



