Identifying gaps in time-series data

Hi there!
I'm fairly new and inexperienced at Matlab so please forgive me if my question is either really easy to solve or if my question is hard to understand. I will try to be as clear as possible!
I am currently trying to write a code which will detect gaps in time-series data. These gaps are not recorded as NaN or 0 values in the raw data, rather they are simply skipped over and no gap is recorded.
Ex: 20110222T180326.761Z 3.216852
20110222T180327.011Z 3.216852
20110222T180844.239Z 3.216751
20110222T180844.489Z 3.216751
Here you can see there is a gap between 18:03:27 and 18:08:44. I want to be able to write a code that can detect these gaps quickly so that I don't have to manually find them.
The process of detecting these gaps is also slightly more complicated because I am not using a full version of Matlab. Rather, the project I am working for (Neptune Canada) provides users the ability to write and run code from our website but this doesn't allow access to many of Matlabs features besides the command window. As a result, I am having trouble manipulating the data because of the specific way the data must be accessed.
Release: R2009a (Coming Soon R2011a) Toolboxes: Database, Statistics
The code that I have at the moment is as follows:
% Get current time for a reference of processing time
tic
% Device and Sensor IDs:
deviceid=22663;
neptune(1).sensorID=7253; % conductivity
neptune(2).sensorID=7254; % pressure
neptune(3).sensorID=7255; % salinity
% Enter start date and end date for desired time series using Oracle timestamp data type. For example:
startdate='22-Feb-2011 06:00:00 PM';
enddate='22-Feb-2011 07:00:00 PM';
% RETRIEVE TIME SERIES AND BASIC METADATA FOR EACH SENSOR
[neptune,meta,nd]=getData(neptune,startdate,enddate,deviceid);
% neptune is a structure array containing the data for each sensor
% meta is a structure containing basic metadata
% nd is a boolean which will be 1 if there is time-series data, and 0 otherwise.
I didn't write this code myself, it was provided as a template to the user to access data from the database for a specific instrument. For this specific instrument I know that the sampling period is 0.250 seconds so I wrote a code to determine the difference between timestamps as:
t=datenum(neptune(1).time,30);
dt=diff(t)*86400;
I thought I could somehow find all the instances where dt>0.500 and insert NaN but I am not sure how to do that and I'm also not sure that would even solve my problem since I want to know the specific time where the gap occurs so that when I plot the data it appears as a gap not as line connecting the two points.
I know this is a long-winded, wordy question but if anyone has any suggestions I could try or resources I could refer to I would really appreciate it. I am a Co-op student and I have used Matlab before but not with large data sets and especially not with data sets that are missing information! Please let me know what you think or if I need to elaborate further.
Cheers.

1 Comment

You explained your problem well although many of the info are irrelevant. Please provide an example data of "t" and explain what outcome do you expect. I am sure a solution could be found.

Sign in to comment.

 Accepted Answer

EDIT per Fangjun's suggestion
% Suppose input is (two gaps over the threshold)
neptune.time = datenum({'20110222T180326.761'
'20110222T180327.011'
'20110222T180844.239'
'20110222T180844.444'
'20110222T180844.665'
'20110222T181874.239'}, 'yyyymmddTHHMMSS.FFF');
% Threshold, all gaps, idx of bad gaps
threshold = 1/86400*0.25;
gap = diff(neptune.time);
idx = find(gap > threshold);
% Gaps that exceed threshold
gap(idx)
% Dates
strcat(datestr(neptune.time(idx),'yyyy/mm/dd HH:MM:SS'),...
' -',...
datestr(neptune.time(idx+1),' yyyy/mm/dd HH:MM:SS'))
Note: the threshold for 0.25 secs sampling is:
1/86400*0.25

13 Comments

That is a very smart use of datanum(). This is the first time I learned that it can be used this way.
The problem is that I don't have just a simple matrix to give me the time stamps. When I input
[neptune,meta,nd]=getData(neptune,startdate,enddate,deviceid)
it returns with
neptune =
1x4 struct array with fields:
sensorID
units
sensorType
name
code
dat
time
so I tried using the idx code you suggested by writing:
t=datenum(neptune(1).time,30); %this is the only way I have been able to access the timestamps from the neptune array
idx = find(diff(t)>(1/86400*0.5))+1 %0.5 to detect only gaps larger than 2 sampling periods
and this returns:
idx = 830
8949
So now I'm just not sure what these output numbers mean or how/if they are telling me what I want to know. Thanks so much for the help already!
{neptune(1).time(idx-1), neptune(1).time(idx)}
type neptune.time in command window, can you show the output?
If I enter neptune(1).time I get an error message saying "Error: Not allowed to execute a file outside codes"
The code I have written now is this:
startgap=datestr(neptune(k).time(idx-1),'yyyymmddTHHMMSS')
endgap=datestr(neptune(k).time(idx),'yyyymmddTHHMMSS')
tgap={startgap, endgap}
cell2mat(tgap)
The only problem is that it displays the data as:
tgap =
[2x15 char] [2x15 char]
ans =
20110222T18032720110222T180844
20110222T18423320110222T184234
Which is almost exactly what I want except that I need the start and end times stamps to be separated so they can be easily read across from left to right.
You have been so helpful!! Thank you so much.
I should also mention that before converting the times to 'yyymmddTHHMMSS' the result from {neptune(k).time(idx-1),neptune(k).time(idx)} showed
ans =
[2x1 double] [2x1 double]
Something else is wrong. Your previous comment say neptune is 1x4 struct array. Now you say neptune(1).time has error.
Type neptune.time, it is suppose to show 4 time-value at once.
Typing neptune.time gives me an extremely long list of values
ans =
1.0e+05 *
7.3456
7.3456
7.3456
7.3456
7.3456
7.3456
7.3456... (all values are 7.3456)
That's good. Assume a=neptune.time, then you can use
idx=find(diff(a)>threshold)+1 to get the index where there is a gap of time that is greater than the threshold.
Thanks. I've managed to work out the code so that it finds the gaps but I want them listed in two columns where one is the start time of the gaps (idx-1) and the second column is the end time of the gaps (idx). Any suggestions? I would also rather have the time in yyyymmddTHHMMSS.FFF format rather than the vector form so it is easier to read the exact date and time that the gaps are occurring.
idx = find(diff(neptune.time) > threshold)+1;
strcat(datestr(neptune.time(idx-1),'yyyy/mm/dd HH:MM:SS'),' -', datestr(neptune.time(idx),' yyyy/mm/dd HH:MM:SS'))
You can use whatever format you like.
@Oleg, I like this better. Why don't you include this with the expanded example data to make it a complete answer? It shows a good example illustrating date manipulation.
Using strcat(datestr(neptune.time(idx-1),'yyyy/mm/dd HH:MM:SS'),' -', datestr(neptune.time(idx),' yyyy/mm/dd HH:MM:SS')) worked!!
THANK YOU! For some reason the technique below wouldn't work. It had a problem with using datenum(neptune.time,'yyy...') because the "input was not an array of strings" (???). Is this because neptune.time returns the time data in 1.0e+05 * 7.3456 format (vector format if I'm understanding correctly)?
Anyways, I really appreciate all the help the two of you have given me. I have another question about replacing these "gaps" with NaN so that the user can view the entire timeseries with gaps marked by NaN. Should I post a new question to the Answers forum?
Cheers
Brianne

Sign in to comment.

More Answers (1)

Okay, I am not trying to steal the answer from Oleg. Here is an example using the approach he suggested.
In = {'20110222T180326.761'
'20110222T180327.011'
'20110222T180844.239'
'20110222T180854.239'
'20110222T180964.239'
'20110222T181874.239'};
a=datenum(In, 'yyyymmddTHHMMSS.FFF');
idx=find(diff(a)>0.003);
left=In(idx);
right=In(idx+1);
result=[left right]';
output=sprintf('%s -> %s\n',result{:})

4 Comments

That method works if I leave the time values as 1.0e+05 * 7.3456 but when I change them to 'yyyymmddTHHMMSS.FFF' and attempt to put them in a matrix like [left right] there is no space between the two columns. Ex: 20110222T180327.01120110222T180844.239 when I want 20110222T180327.011 20110222T180844.239
Is this not possible because of the date format I require?
result = [left ' ' right]
Provided that
left = In{idx};
right = In{idx+1};
(note the {} rather than ())
With the code as is, that is left=In(idx) and right=In(idx+1) then you are working with cell arrays and [left right] would output as a pair of quoted strings or as something like
[26 x char] [26 x char]
@Walter, I though that (left = In{idx}; right = In{idx+1}; result = [left ' ' right]) would work too. But when I tried it, it only gives one line.
@Brianne Zimmerman, to get your desired format shouldn't be hard. See edited answer above.

Sign in to comment.

Categories

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!