I have Matlab R2012b and am importing a rather large .csv file that is created by another tool. The only column I care about from this file is the one with all of the timestamps in it so I've imported the column as text but it ends up as a cellstring. As I need to convert the timestamp to Seconds of Day to find out if there are any gaps in my data I have done the following:
filename = 'myfilename.csv';
delimiter = ',';
startRow = 2; %it always starts at row 2 as row 1 is the header name
FormatIn = 'HH:MM:SS.FFF'; % defines my time format for the datevec function
formatSpec = '%*s%*s%*s%*s%*s%*s%s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s%[^\n\r]';
fileID = fopen(filename,'r');
textscan(fileID, '%[^\n\r]', startRow-1, 'ReturnOnError', false); %reads my file
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'EmptyValue' ,NaN,'ReturnOnError', false); %creates my data array
fclose(fileID);
Time = dataArray{1, 1}; %outputs the column I want from the .csv file
Time1 = num2str(cell2mat(Time)); % where Time is my input data
TimenoDay = Time1(5:end); % This is supposed to remove the 246 and space leaving me with 14:32:54.342 231
TimenoSpaces = TimenoDay(~isspace(TimenoDay)); % This is supposed to remove all spaces leaving me with 14:32:54.342231
[Y M D H MN S] = datevec(TimenoSpaces,formatIn); this is supposed to break the time into a vector array
SOD = (H*3600+MN*60+S); % this should convert all of my time stamps to Seconds of Day
It works fine till I get to the datevec function and it runs very slow, as in 30 minutes+. Now, it is ~ 5 million rows of data, but I did something similar yesterday and it only took a few minutes to run the datevec function with a similar amount of data. The only difference between the two scripts is that the one that want quick had a normal time format and I could import it as a Time format (14:42:36.243213) instead of as text with the script above. I think my problem is because of the Julian day and the extra spaces in the timestamp.
Does anyone have any ideas that could help me speed this up? Make my code better?
Edit#1: So I see why the datevec isn't working. when I run
Time1 = num2str(cell2mat(Time)
it returns a 20 character cell with all 2's in it. But if I do
Time1 = num2str(cell2mat(Time(1:1)))
and just look at the first cell of data it works fine.

2 Comments

dpb
dpb on 9 Dec 2015
What is the format of the input file? If all you want is the time data I'd just read the date data directly and convert it saving a step. Precise code would depend on the format...
It's a .csv file with 5,272,494 rows and spans columns A - AQ. The only column I care about (for this check anyway) is column G which gives me a timestamp of DDD HH:MM:SS.FFF sss (milliseconds I believe). All I need for the check I'm performing is the HH:MM:SS.FF sss, without the space. So I want my end format to look like HH:MM:SS.FFFsss. I need it in a string that I can then perform the datevec function on so I can separate it by H M SS to then convert to Seconds of Day. This will allow me to look for gaps in my data greater than 90 milliseconds.

Sign in to comment.

 Accepted Answer

dpb
dpb on 9 Dec 2015
Edited: dpb on 10 Dec 2015
Again, pasting a (partial even) row of text with the actual field of interest would relieve confusion regarding the form, but presuming the 'sss' is actually microseconds (since the .FFF includes down to the msec resolution) and is in the file, then I'd do something like
fmt=[repmat('%*s',1,'G'-'A') '%3f %2f:%2f:%f %*[^\n]'];
d=cell2mat(textscan(fid,fmt,'delimiter',','headerlines',1,'collectoutput',1));
Now you'll have have an array Nx4 that are the [day,hr,min,sec.fff] values and you can pass those directly to datevec with the leading columns yr,mo as zero or some arbitrary start date (it'll all get subtracted out in the end so is immaterial as to absolute value and the time fields are all stored in the fractional portion of the datenum value. Keeping the day will eliminate issues regarding clock rollover if any data were to have been collected over the bewitching midnight hour.
Then, simply looking for
diff(datenum([YR0 1 d])>Setpoint)
will give you the desired answer. The above should be pretty quick.
Again, all above is air-code w/o an example subset of data to 'spearmint with..."salt to suit" :)
ADDENDUM
OK, with the sample data file to check on formatting, the rest is pretty much as outlined. You don't have any calendar data other than DOY and Matlab datenum needs a reference year so just pick something arbitrary as you're only looking for differences so it all gets subtracted out again, anyway.
dn=datenum(2000,0,dat(:,1),dat(:,2),dat(:,3),dat(:,4)+dat(:,5)/1E6);
dt=diff(dn)*86400*1000;
Above takes the date information you have and creates a datenum sequence. As noted before, this will have the benefit of handling clock rollover, etc., automagically. Then the "delta-t" dt is the sample time differential in msec which you can test. For a sample subset of the file supplied, this looks like--
>> dt(10:40)
ans =
0.0905
0.4526
0.7343
0.6940
0.6840
0.6940
0.6940
0.6940
10.2494
0.3722
0.3722
0.6940
0.5532
0.6840
0.1308
0.3017
0.6940
0.5834
0.2112
0.1308
0.4124
14.8058
0.3722
0.3722
0.6940
0.5532
0.6840
0.1408
0.2917
0.1509
0.3923
>>
Seems like the acquisition must be on a time-sharing system where there's a periodic latency, maybe???
Anyway, your end object is now simply to find the locations that don't satisfy your criterion. There are several choices for how to report this; you can use a logical vector of locations that satisfy, find locations that don't, report if any don't or all do, etc., etc., etc., ...
>> isOK=all(dt<90)
isOK =
1
>> sum(dt>10)
ans =
4
>> find(dt>90)
ans =
Empty matrix: 0-by-1
>> find(dt>10)
ans =
18
31
53
61
>>
Above uses a couple different fixed setpoints; you'll make that a variable, of course. I used an example of 10 just to illustrate the case of a few failing. The logical-addressing logic vector solution returns a full-length logical array of T/F, of course. Which turns out easiest to use is all dependent upon end objectives.

16 Comments

dpb, thanks for the help and I'm sorry for not responding sooner. I got pulled away on another project yesterday afternoon.
I have a screenshot of my input sheet that hopefully will answer your question about my input file as well as the addition of the code you gave me and what its output looks like.
My apologies in advance for any ignorance I may show. This is my first adventures into coding since PHP and CSS in college 15 years ago and I'm teaching myself as I go.
I was trying to understand how to pass my new Nx4 array into datevec, but every time I tried it I was getting 6 columns as expected but it was splitting my times up weirdly. As in the day was being split across the Year and Month and the Hour was splitting into the day and hours and the minutes were split between the hours and minutes. So I'm not sure what I was doing wrong.
My input file:
My code and output variables:
Well, that didn't work like I thought it would. Try this link to download the files.
dpb
dpb on 10 Dec 2015
Please use cut 'n paste of text from the command line and/or editor, not image. We old fogies don't have the eyes any longer and can't do anything with the data when displayed that way, anyway.
It doesn't need but a few lines of the file at most; the rest is nothing but volume; just like adding 9-digit numbers in junior high didn't add anything to the principles learned with 2 or 3 columns in elementary school; it's just volume, not content.
Your link is also just to images; use the paperclip icon to attach a file in the forum or, again, as suggested, just cut 'n paste a few lines of it from the editor and same with your code.
Sorry. I'm working between two computer systems here. I'll get a sample file of the input file to link and I'll paste my code into a sheet I can copy over and then copy and paste for you. Apologies. I didn't think about that being an issue.
OK, here are my sample files. And again, I'm sorry, I should have thought about that instead of images, I've been around forums long enough not to make that stupid of a mistake.
dpb
dpb on 10 Dec 2015
Thanks but "Houston, we still have a problem!" Unfortunately, Firefox can't handle however it is that TMW has encoded their links to non-text files; they're opened or saved as text which is, of course, not a readable Excel file.
That's not your fault, it's either TMW's or Firefox's; I've yet to determine just who's to blame.
Either way, an Excel file isn't a .csv file which your original posting indicated it was/is...if that's the case, you'll need to either convert to an actual text .csv file or if keep the Excel format, use xlsread in which case you're going to get character data from that string I think as I don't believe xlsread will be able to interpret the time format as written natively.
I'd suggest again revert to text and I'll raise the question (yet again) re: the file links in the Forum as it 'tis a bother, indeed and this isn't the only time/case).
/bangs head slowly on desk.
Yes. It's supposed to be a .csv file. When I copied the snippet over and saved it, guess what the default format is on my computer.... Yeah, .xlsx. Oh well. I'll stop trying to be fancy and do it the way you've asked repeatedly.
Bus Name Bus ID CMD Type RT Bus A/B Sub Addr Time CMD-1 CMD-2 Tx Status Rcv Status Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 Word 11 Word 12 Word 13 Word 14 Word 15 Word 16 Word 17 Word 18 Word 19 Word 20 Word 21 Word 22 Word 23 Word 24 Word 25 Word 26 Word 27 Word 28 Word 29 Word 30 Word 31 Word 32
1553 18 RT to BC 10 B 1 246 18:27:08.152 160 5420 5000 fcff 26 b93e ff83 a 20 d5 187 43 fff6 ff9b fc0d da01 1b37 dcf2 c991 50a 198 ffff 28c4 ffff ea45 0 9a5 3ff 3977 b9 c16 dd9 5717 6cde 95c1
That is the first line of the input file and it truly is a .csv file.
filename = 'D:\MatLab Files\Import Files\mydatafile.csv';
delimiter = ',';
startRow = 2;
formatSpec = [repmat('%*s',1,'G'-'A') '%3f %2f:%2f:%f %*[^\n]'];
fileID = fopen(filename,'r');
dataArray = cell2mat(textscan(fileID, formatSpec, 'Delimiter', delimiter, 'EmptyValue' ,NaN,'HeaderLines' ,startRow-1, 'ReturnOnError', false));
fclose(fileID);
And that is the code I'm using to import the file.
I hope this helps. And thank you for taking the time to walk through this with me. It's much appreciated.
And just in case it works this time as well, I've uploaded the snippet as the .csv file it should have been.
OK, that works...I see a couple things. First, use 'collectoutput',true to return the data array as an array. Secondly, looks like you do need the remaining precision of the time field that I had ignored thinking you had enough precision in the first seconds and fractional thereof. That modifies the format to
fmt=[repmat('%*s',1,'G'-'A') '%3f %2f:%2f:%f %f %*[^\n]'];
which is simply adding one more scanned field before skipping the rest of the line. With that and the aforementioned change, I get
>> dat=cell2mat(textscan(fid, fmt,'Delimiter',',', ...
'HeaderLines',1, 'collectoutput',1));
>> whos dat
Name Size Bytes Class Attributes
dat 67x5 2680 double
>>
>> dat
dat =
246.0000 18.0000 27.0000 8.1520 160.0000
246.0000 18.0000 27.0000 8.1520 852.0000
246.0000 18.0000 27.0000 8.1520 984.0000
246.0000 18.0000 27.0000 8.1530 278.0000
...
246.0000 18.0000 27.0000 8.2300 961.0000
246.0000 18.0000 27.0000 8.2310 652.0000
246.0000 18.0000 27.0000 8.2320 204.0000
246.0000 18.0000 27.0000 8.2320 895.0000
>> num2str(dat(end,4)+dat(end,5)/1E6)
ans =
8.232895
>>
So, reading the data is easy; what again now is it that you're after, precisely?
I'm building a reusable gap checking script. So that anyone could come along, take our data, prepare the input file, and run the script. The scripts end result is several graphs, but it's main intent is to look for gaps in the data. Meaning from one timestamp to the next, if there is a gap greater than 90ms, I want to know.
So I want to take that timestamp, make it into a seconds of day format so I can find the difference and convert the difference to ms easily, this is me coming from excel where it's never easy. I'm expecting to see a difference of 18-22ms. Later on, I have some functions that tell me the max gap size, the average gap size, the minimum gap size, and graph the total amount of gaps.
I think that covers it.
dpb
dpb on 10 Dec 2015
OK, see addendum to previous answer. In Matlab it's essentially trivial.
Thank you. I haven't had a chance to check all this out yet, meetings suck some times. I'll look it over tomorrow.
Two problems I'm running into.
1) My dataArray output is giving me the 5 columns I expect to see, however the issue is when a value for the seconds is a whole number, i.e. 38 vs. 37.999.
246 0 0 37.9640 646 ....
246 0 0 37.9990 982 ...
246 0 0 38 493
This is causing a huge jump in my differences as it appears that when it combines the seconds/milliseconds to the microseconds, it makes it look like
0:0:37.9640646
0:0:37.990982
0:0:38493
What I think I need the formatspec to do for me is to always give me the 3rd column out to 3 digits past the decimal, not four, even if that means I see
38.000
Making the entire row look like
246 0 0 38.000 493
for a time of
0:0:38.000493
I tried doing
formatspec = [repmat('%*s',1,'G'-'A') '%3f %2f:%2f%.3f %f %*[^\n]']
but it only ended up truncating the last column of 3 digits down to 1 digit and left the 4th column with 2.xxxx.
2) I forgot about the time it takes for the data to sync to the GPS clock meaning I have a few thousand rows with a time of zero hours that I need to have the script dump out. The closest I can get is
>>Time(strncmp(Time,'00',2))=[]
but that would only work if it was a string and the first 2 characters were 00. Any ideas on how to do something similar for the data matrix Nx5?
OK, I figured out how to get rid of the rows with an hour of 00
Timestamps = dataArray(logical(dataArray(:,3)),:);
but I'm still struggling with the microseconds conversion. I think the divide by 1E6 in the datenum function is supposed to do that, but I'm not convinced it is. Sadly R2012b doesn't allow me to convert from datenum back to a datetime string that I can find so I can't verify that the datenum string matches up to the source. Unless I'm missing something.
Ok, scratch the part about microseconds. It's what happens when I perform
dt=(dn)*86400*1000;
that has me scratching my head. I'm expecting to see values between 18 and 22 (millisecond values in whole numbers) but I'm seeing
0.1509
0.0905
0.4526
0.7343
0.6940
10.1790
I've tried doing
(dn)*86400
dif(dn)*1000
and several other variations and it always looks the same. So I'm sure Matlab is doing the math right, but it's not what I expect to see and it's confusing me.
/slowly bangs head into desk over and over
I'm confusing my gaps. Sigh. yes, it is doing it right. I had to step away and stop being frustrated. Let me finish running my entire script and compare the output to the output of the tool I'm replacing and I think all shall be good.
thanks for putting up with my ramblings this morning.
Ayup, that's what the dataset you posted shows, anyway. There's one of about 16 msec is the largest in that particular segment of a file (which is, of course, relatively short so don't know what would be max(dt) over a longer acquisition period).
Indeed the 1E6 and 86400*1000 constants are correct for the data file as given; datenums are in integer and fractional days by the implementation.
>> dat(1:4,:)
ans =
246.0000 18.0000 27.0000 8.1520 160.0000
246.0000 18.0000 27.0000 8.1520 852.0000
246.0000 18.0000 27.0000 8.1520 984.0000
246.0000 18.0000 27.0000 8.1530 278.0000
>> num2str(dat(1:4,4)+dat(1:4,5)/1E6,'%.8f')
ans =
8.15216000
8.15285200
8.15298400
8.15327800
shows the precise seconds to the microsecond as given in the input data.
While as you note, datestr won't show more than the msec resolution, you can prove there's more resolution within the date numbers pretty simply--
>> [~,~,~,~,~,s]=datevec(dn(1:4)); % get the seconds individually
Default precision displayed is four digits, but
>> num2str(s,'%.6f')
ans =
8.152199
8.152901
8.153000
8.153297
>>
Ewww...this isn't quite the same--let's see the the rounding error magnitude expected...
>> eps(dn(1))/(1E-6/86400)
ans =
10.0583
>> abs((ss(1)-dv(1,6)))*1E6
ans =
38.7915
>>
The first is the double-precision eps() for a date number at this particular time range divided by the magnitude of 1 usec in fractional days. This shows roughly 10 usec is the theoretical limit for a double precision variable. So, the resulting discrepancy of about 40 usec returned is the order of magnitude that is the limit of what the implementation can provide. For the sample data, the statistics are
>> er=abs((ss-dv(:,6)))*1E6; % difference between input/datenum, usec
>> [max(er) min(er) mean(er)] % range and average difference
ans =
53.8687 1.7163 25.2802
>>
But, given the requirement for checking within msec resolution, this is well within it and has the convenience behind it of dealing with whatever rollover there may be in any particular time sequence irregardless of time of day or even year rollover if the data acq is a continuous process.
If you were to need the full microsecond resolution, then you would have to do the conversion of the time fields independently to get the full double precision within range.

Sign in to comment.

More Answers (0)

Categories

Asked:

on 9 Dec 2015

Commented:

dpb
on 12 Dec 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!