filling data gaps in time series using multiple linear regression
Show older comments
Hello,
I have multiple rain stations in a catchemnt. I choose 3 of them which are close to eachother. I have 10 years of data from each station, but all 3 times series have data gaps.
I choose 3 years of data without gaps (minor 1 or 2 days gap) within the time series to find the correlation as below.

Then I did Multiple Linear regression as shown below.

Questions.
1) Is this the write procedure fo fill data gaps in time sereis using Multiple Linear Regession (MLR).?
2) stats have 6 values. Kindly help me understanding what every value means.(If it was 1x4 then I know it is R-squared, F-stats, p , signfance).
3) when I used the equation and picked random value to find the predicted value y(which I know from time series) it is 0.4 and predicted value is 5.6.
4) If I have really strong correlation coefficient why predicted value is not so close?
Any help would be appreciated.
Thanking you in advance
16 Comments
Star Strider
on 13 Nov 2021
I am not exactly certain what the code is doing, however it would likely be worth reading the data using readtimetable) and using the retime function to interpolate the missing values.
Using timetable arrays are best for this sort of work. All the necessary functions already exist, and they are relatively easy to learn to use effectively.
.
Muhammad Haris Siddiqui
on 13 Nov 2021
Dave B
on 13 Nov 2021
wrt question 2: this is not stats, it's what the documentation calls bint (the lower and upper confidence intervals for the what the documentation calls b and that you have named R). For instance, the 95% CI estimate of the offset (.0685) is -.0432 to .1803.
To get stats, you want the 5th output. If you don't care about the others you can use ~:
[R,~,~,~,stats]=regress(...)
Star Strider
on 13 Nov 2021
My pleasure!
‘I don't want to interpolate. I want to do Multiple Linear regression to fill the gaps in time series.’
That’s interpolation, even if you don’t want to call it that!
That technique uses linear interpolation, although much less efficiently than the existing functions available to use with timetable arrays.
If the intent is to do a multiple linear regression on the existing data, do the regression on the data interpolated using retime, or simply do the regression on the data with missing values. The regression algorithms have no idea what the data are, don’t care if there are any missing values or anything else, so long as all the values are finite and real (in this instance), and the dimensions match.
.
If you want to try to interpolate a missing station reading for a given instance at times for which the alternate stations are available, then perhaps the MLR option would make some sense. This would not be interpolation with time, however, your model has no time component.
For precipitation data, that's probably at least as good as, and perhaps better than using interpolation over time for the given station since preciptation events occur stochastically, not with any functional form.
However, any technique such as this is highly fraught with danger, particularly when applied globally without serious model checking including visualization.
I would venture your replacement of NaN values with zeros and doing the regression with those is less reliable than if you were to remove those observations entirely from the data set.
In order to then subsequently use this model, you would have to apply it individually to the observations where there is data for the predictor variables and the response station is missing; not globally. It would also this way need a model for each missing station using some other set of predictors for each; it would not be reasonable to assume the same model would be at all meaningful for all missing locations.
It's an interesting concept; it would be far easier to make realisitc comments if you were to attach the dataset for folks to poke at.
As @Dave B says, you've misinterpreted the return values from regress() above, read the documentation much more carefully before proceeding.
Dave B
on 13 Nov 2021
I was also suspicious of the replacement of 0 values with NaN's. If there are cases where you have NaN for more than one station, then this will artificially raise the correlation coefficients and simultaneously reduce the (true) predictability of the model.
Another thought: whenever I'm dealing with regression problems I like to use a plot for a reality check. Consider: can you make a plot to see what's going on here? This can be a little tricky with mutliple dimensions, but you might be able to visualize this with scatter3 for the raw data and either scatter3 or a surface for the predicted results. It will likely be easier to do this if you choose a subset of the data for this approach.
dpb
on 13 Nov 2021
I think it will be highly dependent upon the type of rain events prevalent in the given area--if rainfall tends to be widespread, the idea would seem to have merit. If, OTOH, most rainfall is local thunderstorms as it is here where I am located, unless the predictor stations are very close indeed, the chances are they won't be particularly good surrogates.
We can get a nice rain here at the house but often by the time get to east edge of town only 2.5 miles W it may have only sprinkled if done anything at all. Of course, it can also be entirely the reverse as well. We sat at the dining room table (the room about 14' across) of the small house I grew up in at noon one day when I was a kid and watched the rain run off the eaves out the west window for almost 30 minutes before the east half of the house got wet. "It has to end somewhere."
This kind of thing is why serious model verification and data exploration would be imperative.
Muhammad Haris Siddiqui
on 13 Nov 2021
Muhammad Haris Siddiqui
on 13 Nov 2021
A challenge here might be a preponderance of zero data. Suppose I took two weather stations that are far apart, but in the same general region. We might expect that they have rain in the same season, on days that it rains in one it's more likely that it rains in the other, but the amount of rain on those days might be totally uncorrelated.
This would produce a very high correlation coefficient, because those days where it doesn't rain have identical values (0) and those days where it does rain it's more likely to rain in both. Here's a very very reduced example:
x=zeros(10000,1);
y=zeros(10000,1);
n=500;
r=randsample(10000,n);
% when it rains, it typically rains in both, but an uncorrelated amount
x(r)=(rand(n,1)>.05).*(3*randn(n,1)+20);
y(r)=(rand(n,1)>.05).*(3*randn(n,1)+20);
scatter(x,y,'.')
rho=corr(x,y);
[b,~,~,~,stats]=regress(y,[x ones(size(x)) ]);
xi=xlim;
yi=polyval(b,xi);
hold on
plot(xi,yi);
[stats(1) rho^2] % just a reality check
dpb
on 14 Nov 2021
I would think one would want to build the model including data observations only where
- have observations for both all the proposed predictor locations and the response location,
- there is measurable accumlation at at least one of the predictor locations or the response location.
I would investigate very thoroughly the preponderance of instances of the latter in 2. above -- there being observed accumulation at the response station with no accumulation at any of the proposed predictors. That, of course, is certainly not the only diagnostic/model verification step that should be taken, but it's one very obvious one.
Muhammad Haris Siddiqui
on 14 Nov 2021
Dave B
on 14 Nov 2021
@Muhammad Haris Siddiqui glad it's helpful.
I'm not sure if you saw this above, but I really do think that plotting is a good place to start. This is really where MATLAB shines, because you take a problem and have a sort-of interactive data interrogation instead of just trying something and seeing whether it works or not.
I often start with the simplest case I can and go from there: Suppose you start with just using one station to predict another, it's really easy to plot (raw data, fit line, predictions). You can see how well it works and where it's failing and a plot will tell you why it's doing what it's doing.
Then expand to multiple stations, you'll need to adjust you plot: you might consider looking at the prediction of each station independently as above, and then the combined using some combination of tools like scatter3 or surface.
dpb
on 14 Nov 2021
Again, the effectiveness of this model will be highly dependent upon the type of rainfall event your particular catchment sees. If it is, indeed true that "when it rains, it rains" over a large area, then it will likely be a fairly representative surrogate. If it's AZ or SW KS, I'd venture "not so much".
I wholeheartedly agree with @Dave B that visualization should be a key component of exploratory model-building and verification; simply relying on blind correlation is not science.
Muhammad Haris Siddiqui
on 15 Nov 2021
Answers (0)
Categories
Find more on Linear Regression in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!