Filter or delete data from (time)table depending of missing values

Question

0 votes

sample.txt

Hello dear community (from which I already read a lot - thank you).

I've got datasets (solar irradiation - GHI) for approx. 4 years with a 10mn timestep ; and it's full of holes. I need to get a hourly average of GHI and I want to exclude/delete any "hour" that hasn't 6 GHI values (for 0, 10, 20, 30, 40 and 50 minutes within eah hour, of each day, of each month, of each year).

I joined a sample of one dataset. I already know how to correct some flaws of the file, like to round some datetime value that are for exemple (HH:MM:SS) 10:09:59 instead of 10:10:10 (with dateshift 'start','minute','nearest').

I already tried "retime" by creating a timetable, filled with NaN values for missing GHI but this is not convenient as I also need access to the datetime vector (once filtered) for comparisons with other datasets. Is there anyway to avoid a loop (datasets sizes vary from 200k to 300k rows) ? If not, a hint would be appreciated.

Thank you very much

10 Comments
Show 8 older comments Hide 8 older comments

Walter Roberson on 10 Apr 2021

Open in MATLAB Online

Snan = @(v)sum(v) + 0/(numel(v)==6);
datasum = rmmissing(retime(data, 'hourly', Snan));

(This was a bit more difficult to develop than might appear from the code.)

Note that this will treat partial hours (beginning / ending of data) as if the rest of the data is missing... which is true from one perspective.

Also note that this will always treat an hour entry as being the beginning of a new group of data. For example, 01:10, 01:20, 01:30, 01:40, 01:50, 02:00 would be treated as 5 entries for 01:xx and 1 entry for 02:xx and both hours would be declared as incomplete. If you want the exact hour to be treated with the previous hour, subtract 1 from the minutes (or seconds)

The 0/(numel(v)==6) is a hack to introduce a nan in the case where there are not exactly 6 elements, without coding a literal nan. Literal nans "pollute" the computation: you cannot, for example, eliminate a nan by saying NaN*0 .

Another way to write the code without that hack would be something like

Select = @(v, cond) v(cond+1);
Snan = @(v) Select([nan, sum(v)], numel(v) == 6);

Paul MAUGER on 10 Apr 2021

Edited: Paul MAUGER on 10 Apr 2021

Open in MATLAB Online

It's very elegant indeed. In fact, I need the mean of the 6 GHI values of hour 01:00 to 01:59 to account for hour 02:00 (our point of view is the average energy accumulated in 1 hour). I just added a " +1/24 " in this line to do the trick :

sample_datetime = dateshift(datetime(sample{:,1} +1/24 ,'ConvertFrom','datenum'),'start','minute','nearest');

In the function you generously proposed, I modified " sum " for " mean " and at first glance it seems the results are OK (at least for the 2 first complete hours I checked). The code has been modified this way :

Mnan = @(v)mean(v) + 0/(numel(v)==6);
datamean = rmmissing(retime(data, 'hourly', Mnan));

Now, to be honest I have trouble understanding the purpose of 0/(numel(v)==6) despite your kind explanations and that I understand what does numel.

What I understand is :

1st line : Mnan is a function and v its variable

2nd line : v would be any value of GHI from " data ", grouped 'hourly', on which the function applies

Is it that 0/numel(v), when v ≠ 6, gives a " 0/0 ", hence a NaN ?

So it automatically exlcudes any hour from " datamean " when there wasn't 6 values of GHI ?

Thanks a lot for your time. Please let me know how I can validate your answer as it's my first time asking something here.

Oh I almost forgot, I tried to extract the datetime vector from the TimeTable with : datamean.Time ; any hint of how I shoud do (I feel more confortable working with vectors with my little experience) ?

Paul MAUGER on 10 Apr 2021

Edited: Paul MAUGER on 10 Apr 2021

Open in MATLAB Online

EDIT : That's OK, sorry I needed time and trials to understand. I used " datamean.sample_datetime " and I got the vector I need.

Grateful thanks for your help

In fact I need the filtered hourly timestamps, which I have to extract from " datamean " as in original " data ", the timestamps are not " hourly " and also contains unwanted (because uncomplete) hours.

I have about 6 experimental datasets to compare with 2 hourly GHI simulations datasets ; for the easier datasets (2 columns vectors : datetime / GHI) I already compared, I used this trick :

I = ismember(dataset_experimental_datetime,dataset_simulation_datetime); % index of corresponding datetime between experimental and simulation datasets
dataset_experimental_datetime = dataset_experimental_datetime(I==1,:); % filtering experrimental datetime vector
dataset_experimental_GHI = dataset_experimental_GHI(I==1,:); clear I % filtering eperimental GHI vector according to datetime

I then get my to two " N rows x 2columns " experimental and simulation datasets which are the same size and every experimental GHI at a given hour can be compared to the simuation GHI at the same given hour.

I would like to use the same method, thus I need to access the timestamps vector of the filtered timetable.

I'm not used to TimeTable, I tried it for first time, by despair, yesterday before you helped me, without knowing if it would be OK.

Again, thanks a lot for time and explanations.

Paul MAUGER on 12 Apr 2021

Thanks for the advice. I indeed thought I would gain some speed not keeping tables and array I don't need anymore.

Walter Roberson on 13 Apr 2021

Open in MATLAB Online

Consider for example

bar = [1 2 3];
bar(2)
clear bar
bar(2)
if rand() < 0.5; bar = [4 5 6]; end
bar(2)

The first bar(2), bar is a variable and indexing is to be used.

The second bar(2), bar has been cleared. But look -- bar() is the name of a function, so the second bar(2) invokes the graphics function!

The third bar(2)... bar might have been assigned a value, or it might not have been, so the third bar(2) might be indexing or it might be a function call...

If there had been no clear of bar, then it would not be necessary to re-resolve the name and no need to have to insert the hooks to deal with the possibility that you are dealing with function instead of variable.

In the case where a variable is never going to be referred to again in the same function, then it is not clear whether there is a performance penalty for clearing the variable as opposed to assigning empty to it, but Mathworks has specifically guided to avoid clearing variables inside a loop.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Walter Roberson on 10 Apr 2021

Open in MATLAB Online

0 votes

[Copying answer down from comments]

Snan = @(v)sum(v) + 0/(numel(v)==6);
datasum = rmmissing(retime(data, 'hourly', Snan));

(This was a bit more difficult to develop than might appear from the code.)

Note that this will treat partial hours (beginning / ending of data) as if the rest of the data is missing... which is true from one perspective.

Also note that this will always treat an hour entry as being the beginning of a new group of data. For example, 01:10, 01:20, 01:30, 01:40, 01:50, 02:00 would be treated as 5 entries for 01:xx and 1 entry for 02:xx and both hours would be declared as incomplete. If you want the exact hour to be treated with the previous hour, subtract 1 from the minutes (or seconds)

The 0/(numel(v)==6) is a hack to introduce a nan in the case where there are not exactly 6 elements, without coding a literal nan. Literal nans "pollute" the computation: you cannot, for example, eliminate a nan by saying NaN*0 .

Another way to write the code without that hack would be something like

Select = @(v, cond) v(cond+1);
Snan = @(v) Select([nan, sum(v)], numel(v) == 6);

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Filter or delete data from (time)table depending of missing values

10 Comments
Show 8 older comments Hide 8 older comments

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

Filter or delete data from (time)table depending of missing values

10 Comments Show 8 older comments Hide 8 older comments

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

10 Comments
Show 8 older comments Hide 8 older comments

0 Comments
Show -2 older comments Hide -2 older comments