Filter or delete data from (time)table depending of missing values

Hello dear community (from which I already read a lot - thank you).
I've got datasets (solar irradiation - GHI) for approx. 4 years with a 10mn timestep ; and it's full of holes. I need to get a hourly average of GHI and I want to exclude/delete any "hour" that hasn't 6 GHI values (for 0, 10, 20, 30, 40 and 50 minutes within eah hour, of each day, of each month, of each year).
I joined a sample of one dataset. I already know how to correct some flaws of the file, like to round some datetime value that are for exemple (HH:MM:SS) 10:09:59 instead of 10:10:10 (with dateshift 'start','minute','nearest').
I already tried "retime" by creating a timetable, filled with NaN values for missing GHI but this is not convenient as I also need access to the datetime vector (once filtered) for comparisons with other datasets. Is there anyway to avoid a loop (datasets sizes vary from 200k to 300k rows) ? If not, a hint would be appreciated.
Thank you very much

10 Comments

but this is not convenient as I also need access to the datetime vector
Wouldn't you just access the time vector from the timetable ? It should tell you what the times of the entries are.
Sorry I tried to be consised but I feel I shoud explain more where I am with the data. For now I got there :
% Loading Dataset
sample = readtable('sample.txt');
%% Processing Table Data
sample_datetime = dateshift(datetime(sample{:,1},'ConvertFrom','datenum'),'start','minute','nearest');
sample_GHI = sample{:,9};
Now, I don't know which direction I shoud take to filter the data as explaind (have I been clear enough in the first description). I created a TimeTable with the two vectors (datetime and GHI)
data = timetable(sample_datetime,sample_GHI);
... but I really don't see how can I exclude/delete every "incomplete" hour.
Maybe the timetable isn't even necessary ; maybe working with classic array would be easier ? maybe the function find can help ?
That's where I am right now, I'm a bit lost with options I don't have no control over.
Snan = @(v)sum(v) + 0/(numel(v)==6);
datasum = rmmissing(retime(data, 'hourly', Snan));
(This was a bit more difficult to develop than might appear from the code.)
Note that this will treat partial hours (beginning / ending of data) as if the rest of the data is missing... which is true from one perspective.
Also note that this will always treat an hour entry as being the beginning of a new group of data. For example, 01:10, 01:20, 01:30, 01:40, 01:50, 02:00 would be treated as 5 entries for 01:xx and 1 entry for 02:xx and both hours would be declared as incomplete. If you want the exact hour to be treated with the previous hour, subtract 1 from the minutes (or seconds)
The 0/(numel(v)==6) is a hack to introduce a nan in the case where there are not exactly 6 elements, without coding a literal nan. Literal nans "pollute" the computation: you cannot, for example, eliminate a nan by saying NaN*0 .
Another way to write the code without that hack would be something like
Select = @(v, cond) v(cond+1);
Snan = @(v) Select([nan, sum(v)], numel(v) == 6);
It's very elegant indeed. In fact, I need the mean of the 6 GHI values of hour 01:00 to 01:59 to account for hour 02:00 (our point of view is the average energy accumulated in 1 hour). I just added a " +1/24 " in this line to do the trick :
sample_datetime = dateshift(datetime(sample{:,1} +1/24 ,'ConvertFrom','datenum'),'start','minute','nearest');
In the function you generously proposed, I modified " sum " for " mean " and at first glance it seems the results are OK (at least for the 2 first complete hours I checked). The code has been modified this way :
Mnan = @(v)mean(v) + 0/(numel(v)==6);
datamean = rmmissing(retime(data, 'hourly', Mnan));
Now, to be honest I have trouble understanding the purpose of 0/(numel(v)==6) despite your kind explanations and that I understand what does numel.
What I understand is :
1st line : Mnan is a function and v its variable
2nd line : v would be any value of GHI from " data ", grouped 'hourly', on which the function applies
Is it that 0/numel(v), when v 6, gives a " 0/0 ", hence a NaN ?
So it automatically exlcudes any hour from " datamean " when there wasn't 6 values of GHI ?
Thanks a lot for your time. Please let me know how I can validate your answer as it's my first time asking something here.
Oh I almost forgot, I tried to extract the datetime vector from the TimeTable with : datamean.Time ; any hint of how I shoud do (I feel more confortable working with vectors with my little experience) ?
cIs it that 0/numel(v), when v ≠ 6, gives a " 0/0 ", hence a NaN ?
When v has 6 elements then numel(v) == 6 is true, true has numeric value 1, and 0/1 is 0, so in that case you add 0, giving you the mean untouched.
When v doesnot have 6 elements then numel(v)==6 is false, false has numeric value 0, and 0/0 is nan, so in that case you add nan, which poisons any finite value getting nan for the total no matter what the numeric value was.
So now you have values that are either numeric valid values or nan for invalid totals (missing elements). And then you apply rmmissing to the resulting timetable, which throws away the entries with nan.
This also means that you throw away entries that we not missing but had nan explicitly for the value, so if you happen to have data that marks missing with nan you do not need to filter it first.
data = timetable(sample_datetime,sample_GHI);
If that is what you used to build the timetable then datamean.sampletime to get the times.
EDIT : That's OK, sorry I needed time and trials to understand. I used " datamean.sample_datetime " and I got the vector I need.
Grateful thanks for your help
In fact I need the filtered hourly timestamps, which I have to extract from " datamean " as in original " data ", the timestamps are not " hourly " and also contains unwanted (because uncomplete) hours.
I have about 6 experimental datasets to compare with 2 hourly GHI simulations datasets ; for the easier datasets (2 columns vectors : datetime / GHI) I already compared, I used this trick :
I = ismember(dataset_experimental_datetime,dataset_simulation_datetime); % index of corresponding datetime between experimental and simulation datasets
dataset_experimental_datetime = dataset_experimental_datetime(I==1,:); % filtering experrimental datetime vector
dataset_experimental_GHI = dataset_experimental_GHI(I==1,:); clear I % filtering eperimental GHI vector according to datetime
I then get my to two " N rows x 2columns " experimental and simulation datasets which are the same size and every experimental GHI at a given hour can be compared to the simuation GHI at the same given hour.
I would like to use the same method, thus I need to access the timestamps vector of the filtered timetable.
I'm not used to TimeTable, I tried it for first time, by despair, yesterday before you helped me, without knowing if it would be OK.
Again, thanks a lot for time and explanations.
Note: in most cases, you should avoid clearing variables unless it is necessary to test their existence for the flow of control you have established: clearing variables is documented as being lower performance than leaving them. If you have a large variable and you need to release the state, then Mathworks suggests assigning empty of the correct datatype to the variable instead of clearing the variable.
Thanks for the advice. I indeed thought I would gain some speed not keeping tables and array I don't need anymore.
Consider for example
bar = [1 2 3];
bar(2)
clear bar
bar(2)
if rand() < 0.5; bar = [4 5 6]; end
bar(2)
The first bar(2), bar is a variable and indexing is to be used.
The second bar(2), bar has been cleared. But look -- bar() is the name of a function, so the second bar(2) invokes the graphics function!
The third bar(2)... bar might have been assigned a value, or it might not have been, so the third bar(2) might be indexing or it might be a function call...
If there had been no clear of bar, then it would not be necessary to re-resolve the name and no need to have to insert the hooks to deal with the possibility that you are dealing with function instead of variable.
In the case where a variable is never going to be referred to again in the same function, then it is not clear whether there is a performance penalty for clearing the variable as opposed to assigning empty to it, but Mathworks has specifically guided to avoid clearing variables inside a loop.

Sign in to comment.

 Accepted Answer

[Copying answer down from comments]
Snan = @(v)sum(v) + 0/(numel(v)==6);
datasum = rmmissing(retime(data, 'hourly', Snan));
(This was a bit more difficult to develop than might appear from the code.)
Note that this will treat partial hours (beginning / ending of data) as if the rest of the data is missing... which is true from one perspective.
Also note that this will always treat an hour entry as being the beginning of a new group of data. For example, 01:10, 01:20, 01:30, 01:40, 01:50, 02:00 would be treated as 5 entries for 01:xx and 1 entry for 02:xx and both hours would be declared as incomplete. If you want the exact hour to be treated with the previous hour, subtract 1 from the minutes (or seconds)
The 0/(numel(v)==6) is a hack to introduce a nan in the case where there are not exactly 6 elements, without coding a literal nan. Literal nans "pollute" the computation: you cannot, for example, eliminate a nan by saying NaN*0 .
Another way to write the code without that hack would be something like
Select = @(v, cond) v(cond+1);
Snan = @(v) Select([nan, sum(v)], numel(v) == 6);

More Answers (0)

Categories

Find more on Interactive Control and Callbacks in Help Center and File Exchange

Products

Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!