Cut timetable data following certain conditions

I have created a timetable that has one minute timesteps. I would like to take this timetable and create most likely a cell object of chunks of the data following these conditions listed below. If there is a better way to get this data, I'd love to hear suggestions. Thanks!
Criteria:
-to keep data
  1. At least 5 min of continuous data (so positive, non-zero values)
  2. Maximum of 5 minutes of no non-zero values between times whne data is present.
-to break up data into chunks.
  1. No data for at least 60 minutes.
So, if there are five or more records with data, it should be taken and saved until the continuous records stop or, less than 6 values of no data between records occur. Then, after finding all these groups of data, I'd like to separate the groups with 60 minutes or more of no data between them.
Here's some sample data (timetable format):
'06-Feb-2019 03:40:00' 0 0
'06-Feb-2019 03:41:00' 0 0
'06-Feb-2019 03:42:00' 0 0
'06-Feb-2019 03:43:00' 0 0
'06-Feb-2019 03:44:00' 0 0
'06-Feb-2019 03:45:00' 0 0
'06-Feb-2019 03:46:00' 0 0
'06-Feb-2019 03:47:00' 0 0
'06-Feb-2019 03:48:00' 0 0
'06-Feb-2019 03:49:00' 0 0
'06-Feb-2019 03:50:00' 0.000900000000000000 0.00266666666666667
'06-Feb-2019 03:51:00' 0.00218333333333333 0.0195000000000000
'06-Feb-2019 03:52:00' 0 0
'06-Feb-2019 03:53:00' 0.00213333333333333 0.0118333333333333
'06-Feb-2019 03:54:00' 0.00286666666666667 0.0161666666666667
'06-Feb-2019 03:55:00' 0.00311666666666667 0.0213333333333333
'06-Feb-2019 03:56:00' 0.00161666666666667 0.00833333333333333
'06-Feb-2019 03:57:00' 0.00313333333333333 0.0173333333333333
'06-Feb-2019 03:58:00' 0.000633333333333333 0.00200000000000000
'06-Feb-2019 03:59:00' 0.00451666666666667 0.0521666666666667
'06-Feb-2019 04:00:00' 0 0
'06-Feb-2019 04:01:00' 0.000616666666666667 0.00183333333333333
'06-Feb-2019 04:02:00' 0 0
'06-Feb-2019 04:03:00' 0.000466666666666667 0.00133333333333333
'06-Feb-2019 04:04:00' 0.000500000000000000 0.00116666666666667
'06-Feb-2019 04:05:00' 0.00281666666666667 0.0141666666666667
'06-Feb-2019 04:06:00' 0.00376666666666667 0.0195000000000000
'06-Feb-2019 04:07:00' 0.00371666666666667 0.0150000000000000
'06-Feb-2019 04:08:00' 0.00235000000000000 0.00800000000000000
'06-Feb-2019 04:09:00' 0.00370000000000000 0.0151666666666667
'06-Feb-2019 04:10:00' 0.00341666666666667 0.0245000000000000
'06-Feb-2019 04:11:00' 0.00320000000000000 0.0170000000000000
'06-Feb-2019 04:12:00' 0.00410000000000000 0.0181666666666667
'06-Feb-2019 04:13:00' 0.00535000000000000 0.0396666666666667
'06-Feb-2019 04:14:00' 0.0116666666666667 0.0598333333333333
'06-Feb-2019 04:15:00' 0.00908333333333333 0.0375000000000000
'06-Feb-2019 04:16:00' 0.00708333333333333 0.0265000000000000
'06-Feb-2019 04:17:00' 0.00408333333333333 0.0126666666666667
'06-Feb-2019 04:18:00' 0.00598333333333333 0.0451666666666667
'06-Feb-2019 04:19:00' 0.0410166666666667 0.397333333333333
'06-Feb-2019 04:20:00' 0.0596166666666667 0.680333333333333
'06-Feb-2019 04:21:00' 0.0482500000000000 0.554166666666667
'06-Feb-2019 04:22:00' 0.0347333333333333 0.447500000000000
'06-Feb-2019 04:23:00' 0.00551666666666667 0.0335000000000000
'06-Feb-2019 04:24:00' 0.00425000000000000 0.0196666666666667
'06-Feb-2019 04:25:00' 0.00553333333333333 0.0450000000000000
'06-Feb-2019 04:26:00' 0.00823333333333333 0.0691666666666667
'06-Feb-2019 04:27:00' 0.00661666666666667 0.0560000000000000
'06-Feb-2019 04:28:00' 0.00968333333333333 0.0676666666666667
'06-Feb-2019 04:29:00' 0.00733333333333333 0.0455000000000000
'06-Feb-2019 04:30:00' 0.00471666666666667 0.0340000000000000
'06-Feb-2019 04:31:00' 0.00295000000000000 0.0191666666666667
'06-Feb-2019 04:32:00' 0.00441666666666667 0.0326666666666667
'06-Feb-2019 04:33:00' 0.00618333333333333 0.0575000000000000
'06-Feb-2019 04:34:00' 0.00696666666666667 0.0793333333333333
'06-Feb-2019 04:35:00' 0 0
'06-Feb-2019 04:36:00' 0 0
'06-Feb-2019 04:37:00' 0.000966666666666667 0.00616666666666667
'06-Feb-2019 04:38:00' 0 0
'06-Feb-2019 04:39:00' 0 0
'06-Feb-2019 04:40:00' 0 0
The data above should all be grouped together becasue breaks from 3:50 to 4:37 are less than 5 minutes. Let's assume that between the first timestep and ending timestep there were 60 consecutive zero values. I should metion that in all cases if one Variable is = 0 the second one will as well.

8 Comments

Unfortunately there's no easy way to copy-paste those values into a timetable that we can work with. Could you attach a mat file containing the timetable and any other related variables?
Off the bat I'll recommend not breaking the table apart. Instead, define a variable that merely indexes which rows should be selected and then whenever you need those rows, TT(rows,:) where TT is your timetable and rows is a logical index selecting rows to analyze.
I tried to upload the file. Doesn't seem to want to attach. Any ideas on this?
Can you expand on the alternative method? How can I define a variable that is temporal and apply it to a timetable? Let's start with just indexing groups of data where a min of 5 minutes (records) are kept, with any rows of data adjacent (before or after) included if separated by 5 or less rows of no-data.
Or, maybe this is easier to takle by removing rows with 60 or greater consecutive no-data records?
"I tried to upload the file. Doesn't seem to want to attach. Any ideas on this?"
Were you trying to attach it using the paperclip icon? Was the file too large? If so, we really only need enough rows to cover the pattern you described.
"Can you expand on the alternative method?"
Whether you use that method or not, you'll need to create a variable that identifies the rows you want to select. My proposal is to use that variable to isolate the rows as you're working with the table rather than pulling the selected rows out of the table.
"Let's start with just indexing groups of data where a min of 5 minutes (records) are kept, with any rows of data adjacent (before or after) included if separated by 5 or less rows of no-data."
I think I understand this but an example would really make it solid. Could you provide a short example of a datetime vector (or whatever format your data are in) and which rows should be selected from that vector? Attaching the data would be even better but I'd still like to see an expected output for at least a section of the table.
Note that the data in your question all differ by 1 minute so we cannot see the pattern you're describing.
The above data is a small sample of the total data in the timetble series through this current year. In the case above, I want the sum of all values in each variable column for these rows with data. This is becasue, there is a section of greater than 5 consecutive records (4:03 to 4:34). After finding that grouping the method should look + 5 and - 5 timesteps. If there is no-data on either side, the data is cut at that location. If there is data present in the variable columns for less than 6 records (in this case, timestep 4:37 is separated by only 2 rows of no-data) on either side, then this data is considered part of the original chunk and kept.
In this example, the data shown would be cut at 3:49 and 4:38 becasue there is greater than 5 minutes of no-data.
This method would slim down the total record to include new groupings of time periods with 5 consecutive records and neighboring records if separated by less than 6 minutes of no data.
Then, these new groups would again be parsed to only keep groups with greater than 60 minutes of no records. For example, if this data continued for 40 minutes with 0-values then had a record, we would not keep this data. If there was 61 minutes of 0-values from the end of this set to another record of poitive values then we would keep this group above.
each group will then have a summation on each variable performed in order to plot. In the end I'd like to know how many of these groups exist and deterime the sum of each variable for each group and plot as a single point. The number of points in the resulting figure is equal to the number of unique groups.
@Eric Escoto, have you had a chance to try out the answer?
@Adam Danz, Hi, not yet. Ill try it out soon. Finals week! Had to push this back a few days. Soon.
No worries, good luck with finals!

Sign in to comment.

 Accepted Answer

Well, that took a little more time than I anticipated!
Here's a demo with inline comments explaining the details.
I added a few lines to the demo data from your question so that some data at the end would be excluded. Here I just convert the data into a timetable.
% Create demo data (Column vector)
t = {'06-Feb-2019 03:40:00' 0 0
'06-Feb-2019 03:41:00' 0 0
'06-Feb-2019 03:42:00' 0 0
'06-Feb-2019 03:43:00' 0 0
'06-Feb-2019 03:44:00' 0 0
'06-Feb-2019 03:45:00' 0 0
'06-Feb-2019 03:46:00' 0 0
'06-Feb-2019 03:47:00' 0 0
'06-Feb-2019 03:48:00' 0 0
'06-Feb-2019 03:49:00' 0 0
'06-Feb-2019 03:50:00' 0.000900000000000000 0.00266666666666667
'06-Feb-2019 03:51:00' 0.00218333333333333 0.0195000000000000
'06-Feb-2019 03:52:00' 0 0
'06-Feb-2019 03:53:00' 0.00213333333333333 0.0118333333333333
'06-Feb-2019 03:54:00' 0.00286666666666667 0.0161666666666667
'06-Feb-2019 03:55:00' 0.00311666666666667 0.0213333333333333
'06-Feb-2019 03:56:00' 0.00161666666666667 0.00833333333333333
'06-Feb-2019 03:57:00' 0.00313333333333333 0.0173333333333333
'06-Feb-2019 03:58:00' 0.000633333333333333 0.00200000000000000
'06-Feb-2019 03:59:00' 0.00451666666666667 0.0521666666666667
'06-Feb-2019 04:00:00' 0 0
'06-Feb-2019 04:01:00' 0.000616666666666667 0.00183333333333333
'06-Feb-2019 04:02:00' 0 0
'06-Feb-2019 04:03:00' 0.000466666666666667 0.00133333333333333
'06-Feb-2019 04:04:00' 0.000500000000000000 0.00116666666666667
'06-Feb-2019 04:05:00' 0.00281666666666667 0.0141666666666667
'06-Feb-2019 04:06:00' 0.00376666666666667 0.0195000000000000
'06-Feb-2019 04:07:00' 0.00371666666666667 0.0150000000000000
'06-Feb-2019 04:08:00' 0.00235000000000000 0.00800000000000000
'06-Feb-2019 04:09:00' 0.00370000000000000 0.0151666666666667
'06-Feb-2019 04:10:00' 0.00341666666666667 0.0245000000000000
'06-Feb-2019 04:11:00' 0.00320000000000000 0.0170000000000000
'06-Feb-2019 04:12:00' 0.00410000000000000 0.0181666666666667
'06-Feb-2019 04:13:00' 0.00535000000000000 0.0396666666666667
'06-Feb-2019 04:14:00' 0.0116666666666667 0.0598333333333333
'06-Feb-2019 04:15:00' 0.00908333333333333 0.0375000000000000
'06-Feb-2019 04:16:00' 0.00708333333333333 0.0265000000000000
'06-Feb-2019 04:17:00' 0.00408333333333333 0.0126666666666667
'06-Feb-2019 04:18:00' 0.00598333333333333 0.0451666666666667
'06-Feb-2019 04:19:00' 0.0410166666666667 0.397333333333333
'06-Feb-2019 04:20:00' 0.0596166666666667 0.680333333333333
'06-Feb-2019 04:21:00' 0.0482500000000000 0.554166666666667
'06-Feb-2019 04:22:00' 0.0347333333333333 0.447500000000000
'06-Feb-2019 04:23:00' 0.00551666666666667 0.0335000000000000
'06-Feb-2019 04:24:00' 0.00425000000000000 0.0196666666666667
'06-Feb-2019 04:25:00' 0.00553333333333333 0.0450000000000000
'06-Feb-2019 04:26:00' 0.00823333333333333 0.0691666666666667
'06-Feb-2019 04:27:00' 0.00661666666666667 0.0560000000000000
'06-Feb-2019 04:28:00' 0.00968333333333333 0.0676666666666667
'06-Feb-2019 04:29:00' 0.00733333333333333 0.0455000000000000
'06-Feb-2019 04:30:00' 0.00471666666666667 0.0340000000000000
'06-Feb-2019 04:31:00' 0.00295000000000000 0.0191666666666667
'06-Feb-2019 04:32:00' 0.00441666666666667 0.0326666666666667
'06-Feb-2019 04:33:00' 0.00618333333333333 0.0575000000000000
'06-Feb-2019 04:34:00' 0.00696666666666667 0.0793333333333333
'06-Feb-2019 04:35:00' 0 0
'06-Feb-2019 04:36:00' 0 0
'06-Feb-2019 04:37:00' 0.000966666666666667 0.00616666666666667
'06-Feb-2019 04:38:00' 0 0
'06-Feb-2019 04:39:00' 0 0
'06-Feb-2019 04:40:00' 0 0
'06-Feb-2019 04:41:00' 0 0
'06-Feb-2019 04:42:00' 0 0
'06-Feb-2019 04:43:00' 0 0
'06-Feb-2019 04:44:00' 0 0
'06-Feb-2019 04:45:00' 0 0
'06-Feb-2019 04:46:00' 0.000966666666666667 0.00616666666666667
'06-Feb-2019 04:47:00' 0.000966666666666667 0.00616666666666667
'06-Feb-2019 04:48:00' 0 0};
dates = datetime(t(:,1),'InputFormat','dd-MMM-yyyy HH:mm:ss'); % datetime vector
TT = array2timetable(cell2mat(t(:,2:end)),'RowTimes', dates); % timetable
From here forward the only input is TT which is the timetable produced above. This adds two new columns to the timetable:
  • Safe which is a logical vector identifying the groups of consecutive data +/- 5 rows.
  • Group which identifies the group number for each consecutive block (this example data only has 1 block).
% Identify rows with missing data in any column. Missing data
% are considered to be values of 0.
missingIdx = any(TT{:,:}==0,2);
% Identify groups of consecutive data
consecGroups = bwlabel(~missingIdx); %req image processing toolbox
consecGroups(consecGroups==0) = nan;
groupDuration = splitapply(@range, TT.Properties.RowTimes, consecGroups);
% mark grouped data that spans time greater than minTime as safe
minTime = minutes(5);
safe = false(size(consecGroups));
safe(ismember(consecGroups,find(groupDuration > minTime))) = true;
% Locate not-safe data that are within maxRows number of rows from safe data
maxRows = 5;
notSafeRows = find(~safe);
select = min(abs(find(safe)-notSafeRows.')) < maxRows; % implicit expansion requires >=r2016b; 'safe' is col vec
safe(notSafeRows(select)) = true;
% Assign safe vector to timetable and group number
TT.Safe = safe;
TT.Group = bwlabel(safe); %requires image proc toolbox
Now if you want to perform statistics on a certain group,
sum(TT.Var1(TT.Group==1))

21 Comments

This is nice.
Notice that Adam has not split up the timetable. Instead he's created a new variable that identifies the different block sof "consecutive" data, and then used that "grouping variable" to compute something on one group.
To take this to its logical conclusion:
blockSums = splitapply(@sum,TT.Var1,TT.Group)
computes the sum for all the blocks. (Can't run that right now to make sure it works, but I'm pretty sure it does).
It should also be pretty straight-forward to create a new timetable with one row for each block, containing the (starting?) timestamp, the sum, and whatever other summary stats are needed. rowfun should be able to do that.
Thanks, Peter.
grpstats() is another option to explore.
Something like
statarray = grpstats(TT,'Group','sum') % not tested
Thanks Adam and Peter,
Looks like almost everyting works. Theres one issue with grouping. The period between groups looks to be limited by the 'minTime' in this solution which is related to the 'safe' values. So, groups are split more fequently than they should be.
I suppose its more difficult with just this sample data we have here. I'm looking at this now, any suggestions?
"...groups are split more fequently than they should be"
I'm not sure I follow. Could you provide a small input example where the groups are split but should not be split?
Attached is a sample where two groups are created but only one should exist because less than 60 min are in between these groups.
I just re-read your earlier description under your question (here).
The description states,
"After finding that grouping the method should look + 5 and - 5 timesteps. If there is no-data on either side, the data is cut at that location."
That tells me that at most there will be 5 consecutive minutes of empty data within each group. But then the description states,
"Then, these new groups would again be parsed to only keep groups with greater than 60 minutes of no records."
1) But how could there be 60 minutes of no records within a group when we just defined groups as having no more than 5 consecutive minutes of empty data?
2) Also, the entire timetable you shared spans 60 minutes so how could this be an example of a timetable that's missing 60 minutes of no records?
Slightly confusing, I agree.
There are a total of 3 grouping criteria, although 1, 2 fit together.
Initally, we look at each record individually, then +4 and -4 records from there. This first step creates a minimum row count of 5. The head/tail (or second step, but maybe just for simplicity still included in the first step) of this minimum row count continues if there is a non-zero value and the group terminates when the row before or after the group hits a zero value (it only takes one zero value to terminate this grouping). The range of values for this first step is, rows >= 5.
The next critera continues up/down the rows which sandwich the group that have zero values. We should count these rows with zero values. If there are 60 or more rows with zeros before/after the rows with non-zero values (minimum of 5) then all rows with non-zero values are considered the final group.
In the data I sent you, this is an example of data that should be one group. For simplicity, assume that before and after this dataset, there are 61 zero-value rows. We can add these in if it helps with the parsing.
Perhaps, the first step should be to cutloc where >60 rows of zero values are counted? I'm working on this now.
I'll need to see an example that has more than 1 group. If you can create an example that follows these rules and demonstrates each rule, I might be able to see the pattern. Instead of using 60, use something like 10 to reduce the work.
Here's an expanded dataset.
Is the Safe column the results of our current algorithm or your desired results?
If it's not the latter, I need to see the desired results of the Safe column (and Group column).
Here's the modified dataset. There are two final groups.
Note I also took out the +/- 4 rows in the safe column which don't have non-zero values.
In group 113, the zero values could also be included. This would provide a total duration of one group if all zeros in a group were saved.
@Eric Escoto , just a head's up, my time is thin for the next ~24 hr but I will follow up on this soonish. (I'll replace this comment with my feedback).
No worries! Thanks for the help so far, I'll keep chugging away.
Here's a thought: Using the existing algorithm, I'm looking for a way to set up critera for a new column, that will say, "if there are >=60 values of 0's between the groups column then split and assign another group number in a new column"
Hmm. Still can't seem to easily do this. i've started to just use the group number colum to manually cut and paste each into a new spreadsheet. But, I'm notincing that the first and last four cells in each group are blank but included although they have 0-values in them. Any ideas for a better solution?
Hi Eric, I'm trying to recall this problem from 1/2 a month ago.
In the TTtest_2_desired.mat dataset the problem was that the last two groups should be one group. The reason they are marked as 2 groups is because there is a span of 8 minutes between the two groups. The maxRows variable allows for a total of 5 minutes of empty values (0) between groups in order to combine those groups. If I increase the maxRows to 9, those last two groups are combined. But it's unclear whether this is what you want to do or if a rule needs changed.
For all cases we want to have that maxRows variable at 5 min. So that looks to be ok. The issue is the preceeding and proceeding +4 rows from this action. They include 0-values in the Var1 and Var2 columns. Perhaps, if there was a criteria set so that if the group column equals 0...inf and Var1 and Var2 equal 0 we remove these rows?
The idea is to leave only a continuous data set with increasing group numbers that have values in the Var1 and Var2 columns.
It looks like this almost works for me. I've used the line below to remove the rows when all four variables (Var1 Var2 Safe Group) are equal to zero but this made a double format. Is there a way to write this so that I can keep it in the timetable format?
% Remove rows with all zeros to leave only the unique groups of rain.
TT2_events = TT2_P2_sub{:,:}(any(TT2_P2_sub{:,:},2),:);
This line below will remove all rows where the group is 0.
TT(TT.Group==0,:) = [];
Am I correct in understanding that you still want to remove the leading and trailing 4 rows from each group where the Var1 and Var2 values are 0?
Hi Adam,
Yes, That's correct. I' still want to be able to preseerve the rows inside of the groups that are zeros in var1 and var2, but the leading and trailing four rows from each group should be removed.
Ok.... I think this should do the trick.
Replace this section
% Locate not-safe data that are within maxRows number of rows from safe data
maxRows = 5;
notSafeRows = find(~safe);
select = min(abs(find(safe)-notSafeRows.')) < maxRows; % implicit expansion requires >=r2016b; 'safe' is col vec
safe(notSafeRows(select)) = true;
with this section
% Sliding window method
maxRow = 5;
for i = 1:numel(safe)-maxRow
tempWindow = safe(i:i+maxRow-1);
if tempWindow(1) && tempWindow(end)
safe(i:i+maxRow-1) = true;
end
end
The slidiing window is a window whose length is defined by maxRows. It starts at the top of the safe vector and slides through the vector marking everying within the window as TRUE as long as the first and last elements of the window are TRUE.
I've been playing around with these lines and still don't get what I'm after. I've attached a better example of data to illustrate the desired outcome.
In the attacehd data, we have a timetable with 4420 rows. Using the first set of code, it parses out into three groups as can be seen in the fourth column. In row 4413 there is data in column 1 and 2, but it's more than 60 rows away from the previous data, and is not sandwiched by four other consecutive rows so this will be ignored as a group.
The first three groups should be only one because there is less than 60 rows of no data between groups.
Data locations.
Group 1 in rows 535:550
Group 2 in rows 555:654
Group 3 rows in 692:741
Single record at row 4413
In this case, the safe vector would also be TRUE for rows where data is sandwiched by less than 60 rows of no data. For the atached data set, the outcome should be one grouping from rows 538:740.
It may wind up that the first four rows of each group would be included but I'm thinking that I could then just manually remove them from each group.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!