Filter or delete data from (time)table depending of missing values
Show older comments
Hello dear community (from which I already read a lot - thank you).
I've got datasets (solar irradiation - GHI) for approx. 4 years with a 10mn timestep ; and it's full of holes. I need to get a hourly average of GHI and I want to exclude/delete any "hour" that hasn't 6 GHI values (for 0, 10, 20, 30, 40 and 50 minutes within eah hour, of each day, of each month, of each year).
I joined a sample of one dataset. I already know how to correct some flaws of the file, like to round some datetime value that are for exemple (HH:MM:SS) 10:09:59 instead of 10:10:10 (with dateshift 'start','minute','nearest').
I already tried "retime" by creating a timetable, filled with NaN values for missing GHI but this is not convenient as I also need access to the datetime vector (once filtered) for comparisons with other datasets. Is there anyway to avoid a loop (datasets sizes vary from 200k to 300k rows) ? If not, a hint would be appreciated.
Thank you very much
10 Comments
Walter Roberson
on 10 Apr 2021
but this is not convenient as I also need access to the datetime vector
Wouldn't you just access the time vector from the timetable ? It should tell you what the times of the entries are.
Paul MAUGER
on 10 Apr 2021
Walter Roberson
on 10 Apr 2021
Snan = @(v)sum(v) + 0/(numel(v)==6);
datasum = rmmissing(retime(data, 'hourly', Snan));
(This was a bit more difficult to develop than might appear from the code.)
Note that this will treat partial hours (beginning / ending of data) as if the rest of the data is missing... which is true from one perspective.
Also note that this will always treat an hour entry as being the beginning of a new group of data. For example, 01:10, 01:20, 01:30, 01:40, 01:50, 02:00 would be treated as 5 entries for 01:xx and 1 entry for 02:xx and both hours would be declared as incomplete. If you want the exact hour to be treated with the previous hour, subtract 1 from the minutes (or seconds)
The 0/(numel(v)==6) is a hack to introduce a nan in the case where there are not exactly 6 elements, without coding a literal nan. Literal nans "pollute" the computation: you cannot, for example, eliminate a nan by saying NaN*0 .
Another way to write the code without that hack would be something like
Select = @(v, cond) v(cond+1);
Snan = @(v) Select([nan, sum(v)], numel(v) == 6);
Paul MAUGER
on 10 Apr 2021
Edited: Paul MAUGER
on 10 Apr 2021
Walter Roberson
on 10 Apr 2021
cIs it that 0/numel(v), when v ≠ 6, gives a " 0/0 ", hence a NaN ?
When v has 6 elements then numel(v) == 6 is true, true has numeric value 1, and 0/1 is 0, so in that case you add 0, giving you the mean untouched.
When v doesnot have 6 elements then numel(v)==6 is false, false has numeric value 0, and 0/0 is nan, so in that case you add nan, which poisons any finite value getting nan for the total no matter what the numeric value was.
So now you have values that are either numeric valid values or nan for invalid totals (missing elements). And then you apply rmmissing to the resulting timetable, which throws away the entries with nan.
This also means that you throw away entries that we not missing but had nan explicitly for the value, so if you happen to have data that marks missing with nan you do not need to filter it first.
Walter Roberson
on 10 Apr 2021
data = timetable(sample_datetime,sample_GHI);
If that is what you used to build the timetable then datamean.sampletime to get the times.
Paul MAUGER
on 10 Apr 2021
Edited: Paul MAUGER
on 10 Apr 2021
Walter Roberson
on 10 Apr 2021
Note: in most cases, you should avoid clearing variables unless it is necessary to test their existence for the flow of control you have established: clearing variables is documented as being lower performance than leaving them. If you have a large variable and you need to release the state, then Mathworks suggests assigning empty of the correct datatype to the variable instead of clearing the variable.
Paul MAUGER
on 12 Apr 2021
Walter Roberson
on 13 Apr 2021
Consider for example
bar = [1 2 3];
bar(2)
clear bar
bar(2)
if rand() < 0.5; bar = [4 5 6]; end
bar(2)
The first bar(2), bar is a variable and indexing is to be used.
The second bar(2), bar has been cleared. But look -- bar() is the name of a function, so the second bar(2) invokes the graphics function!
The third bar(2)... bar might have been assigned a value, or it might not have been, so the third bar(2) might be indexing or it might be a function call...
If there had been no clear of bar, then it would not be necessary to re-resolve the name and no need to have to insert the hooks to deal with the possibility that you are dealing with function instead of variable.
In the case where a variable is never going to be referred to again in the same function, then it is not clear whether there is a performance penalty for clearing the variable as opposed to assigning empty to it, but Mathworks has specifically guided to avoid clearing variables inside a loop.
Accepted Answer
More Answers (0)
Categories
Find more on Interactive Control and Callbacks in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!