Subset variable data in one .csv based on values from same variable in a second .csv

I have a large .csv file (~7GB) where the first column is filled with datetimes, and the remaining columns filled with data corresponding to that datetime.
e.g.:
20191025241533, 65.2, 35.2, 75.4
20200122061522, 32.2, 74.5, 61.3
20190411075120, 54.1, 87.0, 22.1
I then have a second smaller .csv which contains a list of datetimes which are of interest to me.
e.g.
20191025241533
20191025241533
I would like to subset the first large .csv based on the datetime values in the second, so I would only keep rows with datetimes present in the second .csv.
Secondly, I would like to use the datetime values in the second smaller .csv to extract files (where the filename is datetime) from a larger folder of files into a separate folder which contains only those files that have the same filename (datetime) as those recorded in the smaller .csv.
Thanks for your help!

11 Comments

If your PC has a sufficient memory to read the large .csv file, the following steps should work:
  1. By using readtimetable function, read the large .csv file and store the data in timetable variable
  2. Read the small .csv file and store as datetime variable
  3. Apply ismember or withtol function to find the corresponding rows in the large .csv file
  4. Extract the corresponding rows and save as .csv file by writetimetable function
If the timestamps in the seconds file are exactly the timestamps you are looking for in the first file, you can just subscript into the larger timetable using those datetimes. Something like
TTinteresting = TTbig(interestingTimes,:)
But with 7GB, you may need to resort to using a tall timetable, or perhaps datastores.
The thing with moving the files is just a matter of string and file manipulation.
My computer has enough memory to read the file as a timetable but not to let me look at it beyond a certain number of rows. Could I do this using csvread instead?
They timestamps in either file are indeed exactly the same so I will try this first.
Thank you!
csvread is not going to help. You will get a numeric matrix with very large laues such as 20191025241533, which will not be easy to work with. It's not clear what you mean by, "not to let me look at it beyond a certain number of rows".
Subscript by time to get a sub-timetable.
Hi Peter,
When I use csvread() on the large .csv I get this:
When I use readtable() I get this:
But I can preview the data produced by readtable() in the command window (following conversion of the first variable to datetime) buy using head(datatable):
The second smaller csv looks like this:
-this file contains what looks like numbers but are actual datetime in format yymmddHHMMSS.
I want to use the datetime values in the second .csv to extract rows of data from the first csv where the datetimes are the same.
I have converted these using:
vessels.Var1=datetime(vessels.Var1, 'InputFormat', 'yyMMddHHmmss');
so now I have:
So, now I think I can use is.member to match values from the second smaller .csv which occur in the large .csv and extract the associated data with those times from the larger .csv.
Does this all sound like a good approach?
@Akira Agata
-saving as timetable and datetime variables works if you use withtol but not ismember, I can't get the ismember approach to work but the withtol does exactly what I was aiming for. Thank you! Do you know how I would select for the variables which don't match the datetime variable, i.e. what would be left in the original tt after removing the rows which correspond to the datetime? What matches is of similar interest as to what doesn't match.
I considered something like ~ismember but this didn't work when using withtol.
Instead I have done this:
S=withtol(smalltable.DateTime,seconds(5)); %(rowTimes, tol=tolerance for matching times)
matching_times=large_tt(S,:); %subset by S (rows which match times in small table) and all columns
not_matching=large_tt; %copy timetable
not_matching(S,:)=[]; %remove rows where times matched
There are several different things going on. You need to solve one at a time.
Again, csvread is not going to help. Don't use it.
The variable editor will not display such a large (7GB) table. That doesn't matter; just use head, or tail, or row subscripting to convince yourself that you have read the data correctly.
Which, it turns out, you haven't. Apparently, the first row of your file contains 0:N. You need to skip that row using HeaderRows.
It's not clear what's really in your files. You descibe the large one as having things like 20191025241533 in the first column, yet in one of your screenshots, it seems that you have (or have converted to) datenums. Don't use datenums. Another screenshot of the large file indicates that you've successfully read the file and have datetimes in your table, that's good. Another screenshot suggests that your small file contains text timestamps?. You seem to have converted those to datetime, that's good, and maybe that's what you did for the large file too. At any rate, it looks like you have a large table and a small table, each of which has a datetime variable.
So, are you not done except for using ismember? If your timestamps match exactly, that's all you need to do:
i = ismember(largeTable.Time,smallTable.Time);
shortenedTable = largeTable(i,:);
If they don't match exactly, then you need to use withtol:
wt = withtol(smallTable.Time,someTolerance);
shortenedTable = largeTable(wt,:);
I don't know what the tolerance would be. Your screenshots indicate that your timestamps are to a resolution of seconds; that suggests that you don't need a tolerance, and should be using ismember.
Hi Peter,
Thank you. Yep, I have used withtol as it seems there is a slight difference in the datetimes.
I will use readtable from now on and I'm using head() to preview.
The large .csv is a spreadsheet which contains frequency values in the first row (1-7200Hz) and datenums in the first column. These datenums once converted to datetime correspond to the filenames of different .wav files e.g.yymmddHHMMss.wav
I wrote these files using datenum as the time vector as this is what a previous colleague had done and I've been building on their code. These spreadsheet is the output of an acoustic analysis which takes days to write so I'm afraid I can't go back and re-do them all as it would be so time consuming. So I have to work with this datenum in the first column. I converted the datenum to datetime and then changed the table to a timetable.
The small file contains the .wav file names in the first column and these are in text form as you say but correspond to datetimes.
Why do you advise not to use datenum?
I had used withtol and a tolderance of 5 seconds which got me a believable amount of matches. Using ismember I get a much smaller number of matches so it looks like there is a bit of a difference, something I'll investigate as I'd expect them to be the same.
I have just realised that the reason they don't match is because when I convert from datenum to datetime, the timing is off by 1 second. Is there a way to fix that?
They are almost certainly not off by 1 sec, but rather off by a small negative amount, and the display is off. Add .SSS to the3 display format and you will see. Because datenum counts in "days", they incur round-off for almost every value that is not a whole day, or at some "nice" whole hour. It's that round-off that is causing this. datetime tries to account for that, but it can't possibly be perfect.
So ... this is one of the reasons why not to use datenums. datetime has been around since R2014b. It's a better choice. If you have no control over the file, and you know the time stamps are to a resolution of whole seconds, use dateshift to round your datetimes.

Sign in to comment.

Answers (0)

Categories

Products

Release

R2019a

Asked:

on 20 Apr 2020

Commented:

on 13 May 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!