How to sort and open up text files titled in a range of numbers
5 views (last 30 days)
Show older comments
Hi Matlab Community,
I have hundred perhaps thousands of text files from an experiment. These files are titled by the time their were captured. I was wondering if their is a method to capture groups of the files at certain points in time. Since the titles of each file were determined in micro seconds there is no pattern in how the files were titled. For example, I would like to open text files titled between uc#6_123000 and uc#6_321000.
Maybe there is an iterative process for this record keeping?
Thanks for the help!
0 Comments
Answers (4)
dpb
on 24 Jul 2014
Use the time stamp as it was generated to create a list and iterate over it. I presume there's a meaning for 123000 and 321000 above and the letters and other digit create a grosser segregation of some sort.
Now if there's no specific time interval between the two integer values above, your list will include files that don't exist; use exist to test first or put the attempt to open the file in a try...catch block to to just skip the non-existent ones.
0 Comments
Michael Haderlein
on 24 Jul 2014
You can use the dir function to get all the files, then extract the time stamp and compare the time stamps of the files with a min/max value:
allfiles=dir('*.m');
d=[allfiles.datenum];
selectedfiles=allfiles(d>=datenum([2014 05 01 0 0 0]) & d<datenum([2014 06 01 0 0 0]));
Should give you all m-files in your folder from this year's May. Then you can start reading selectedfiles.
Best regards,
Michael
dpb
on 24 Jul 2014
datenum([735598.475717593])
Don't need datenum; 735598.475717593 is already a serial date number.
The expression returns a 0-sized structure because your lower limit is greater than the upper -- you asked for >735598+ and simultaneously <735597+. This is impossible.
selectedfiles=allfiles(d>=735597.813472222 & d<735598.475717593);
It would seem better to keep the actual date strings or y,m,d,... vector than the absolute numeric values, though.
Again, you'll have to parse the numeric values from the file names themselves as suggested earlier in order to pick individual specific files as the name string date and the file system date stamp don't correlate as you outlined in the original question.
dpb
on 25 Jul 2014
Edited: dpb
on 25 Jul 2014
datenum returns a Matlab serial date number for the given date/time input as a double. The whole number is the number of days since the reference point, the fractional part is fraction of day. It has precision of roughly msec, and the OS doesn't have better than that anyway so, yes, Virginia, you can't find a file to the microsecond that way.
datenums, however, do have a very well defined order but they will be returned from a call to datenum in the order of the dates presented. If you don't enter a chronological set of values, then the returned values won't be in chronological order, either. This behavior is fully consistent with other matrix operations in Matlab.
ADDENDUM
It is, however, true that the output of dir is generally sorted alphanumerically by name(*), not by file system date; hence the returned datenum value in the directory structure will not necessarily be sequential unless the alpha order and the date order coincide. As noted in later response, you can sort it to process them in order. That still doesn't get those within some preselected range of sample times, however, as also noted.
Just came to me what the likely cause of the order confusion is/was...
(*) This is still dependent on the OS default and any options in play on the particular platform.
ENDADDENDUM
I don't quite follow the naming convention other than the last 9(?) digits--is the c#6_6_1_3_9_ portion a month/day pattern or somesuch?
Again, if you want a set of files within some range of microseconds with a given one of these preceding patterns, you'll need to separate out those microsecond values from the file names and then operate on them to glean out those within some range. That separation process is, indeed, "parsing".
If the file names were as the two examples above, if one were to use dir on the directory as
d=dir('c#6_6*.txt');
one would get a return that would look for the name field something like--
>> d.name
ans =
'c#6_6_1_3_9_472319296.txt'
ans =
'c#6_6_1_2_100_483853850.txt'
>>
From there it's not too tough to get the values for all of the various fields--defining a function handle that can parse the names to numeric values as
f=@(x) cell2mat(textscan(char(x),['c#' repmat('%d_',1,5) '%9d.txt'],'collectoutput',1));
where the input x is the name for each file, we can apply that to each entry in the directory with
>> dtvals=reshape(cell2mat(arrayfun(f,[d(:).name],'uniformoutput',false)),6,[]).'
dtvals =
6 6 1 3 9 472319296
6 6 1 2 100 483853850
Now from this you can use
isOK=iswithin(dtvals(:,6),lo,hi);
to return values within a lo and hi range of microsec's.
iswithin is a helper utility function of mine that looks like
function flg=iswithin(x,lo,hi)
% returns T for values within range of input
% SYNTAX:
% [log] = iswithin(x,lo,hi)
% returns T for x between lo and hi values, inclusive
flg= (x>=lo) & (x<=hi);
It's just "syntactic sugar" but it moves the complexity of the actual comparisons to a lower level for ease in reading the top-level code.
2 Comments
dpb
on 25 Jul 2014
Edited: dpb
on 25 Jul 2014
Yes, the bug is that the code for the function iswithin must, like all other functions, reside in an m-file named iswithin.m
The error is telling you you can't define a function in a script file or at the command line.
On the question of approach -- if there is a header inside the file as well that has the time, you could certainly read it, too. You wouldn't need to rename the files to do so, simply iterate over the returned directory structure. Just return the directory structure from dir for files that have the other suitable characteristics of the desired type of test and/or channels via the wild card name in the search. If, despite the microsecond resolution in the file naming convention there's at least a few milliseconds between the different file creation dates, you could then sort the returned names on the datenum field and process in the sorted order. Then, as you say, you can simply open each in sequence, check the date from the header and accept/reject based on the desired timestamp range. How inefficient this will be depends entirely on just how many files there are as compared to the number desired to be processed and the size of the files. If the header is short and easily parsed, it shouldn't be too bad to simply read a line but it is another step that could be avoided by the above logic to select the files from the name timestamps.
The original idea of creating a list of timestamps within the range is admittedly inefficient as there are likely far more possible timestamps in the range than files and since the timestamp is possible to be any value you can't skip by twos or other difference but must check every one. OTOH, it has the advantage over the above that it would check whether the file exists or not before trying to open/read it so that shouldn't add too much overhead even though the loop could be sizable.
Only if there are duplicate timestamps would the above fail to keep the tests sequential in processing if it were possible for the system to write subsequent files within the resolution of the OS clock. I'd doubt this would be, but suppose if the system were multi-core or otherwise cached stuff theoretically could happen. If that does occur, then the filename parsing is certainly easier than sorting those out.
It seems to me that with the above to parse the filename timestamps you should be on your way ... the error you had is, as noted, simply that you didn't put the utility function in its own file. As an aside, I suggest creating a directory for such code--mine is called 'utilities' and has a plethora of this little snippets. Just create the new subdirectory and add it to the matlabpath so it will be accessible.
See Also
Categories
Find more on File Operations in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!