Why do I get correlation result NaN?

I have two matrices A is of 1*1058 and B is 1*1058, both matrices have some NaN values included in them. Is there any way to get correlation between these two matrices.

 Accepted Answer

Unless you have a good reason to impute your missing data, you can remove missing values from both vectors.
nanidx = isnan(A) | isnan(B);
corr(A(~nanidx), B(~nanidx))

9 Comments

Most likely, correlation is about two sequences. Simply removing nans may not address the problem.
However Ive J's code is removing items that are nan in either entry, so sequence information is retained.
Remember that there is no implicit "time" information in correlation calculation. If you were to shuffle the two matrices the same way, you would get the same correlation.
format long g
A = randn(50,1);
B = randn(50,1);
C1 = corr(A,B);
order = randperm(50);
AA = A(order);
BB = B(order);
C2 = corr(AA,BB);
C1
C1 =
-0.273729352523849
C2
C2 =
-0.273729352523849
C1-C2
ans =
0
It follows from this that it does not matter whether a certain value is in A(37) or has been moved down to A(34) because 3 nan were removed, as long as B(37) is moved to the same relative location.
First, the following code will fail for matrix input:
nanidx = isnan(A) | isnan(B);
corr(A(~nanidx), B(~nanidx))
A(~(nanidx) will be a vector instead of matrix.
Second, we hope that OP do mean corr, rather xcorr. Simply removeing nans without filling missing will cause problem for xcorr, though it might not be an issue with corr.
True, corr does proceed column by column and so removing nans in this way would be a problem if matrices were being used. But the user is using vectors. And xcorr() cannot be used with matrices, only with vectors.
I checked the corr value by removing NaNs, it is successful. But now the MAPE value is not as I expected, it may be because of removing NaNs. And If I include empty data the Corr and MAPE values are NaN.
If you leave in any NaN then corr() will always be NAN.
corr() has to conceptually take mean() of x and y, and mean() of x is sum(x)/numel(x), but when x includes nan then sum(x) is going to be nan. So mean(x) would have to be nan in that case, and then x - mean(x) would be nan everywhere, and nan * some function of y would be nan, and sum() of nan would be nan... so corr() would have to produce nan in such a case.
Agree with your explanation sir. Thank you all the experts for your time. I'm thinking the below procedure to make it more reasonable. please let me know your suggestions.
  1. Find the locations of empty data of two arrays simultaneously.
  2. Find the union of the two sets having locations.
  3. and remove the empty data according to the union.
  4. Find the correlation.
Yes that's the whole idea!
% step 0-create two sample vectors with 5 missing values
sz = 100;
A = rand(sz, 1);
B = rand(sz, 1);
A(randperm(sz, 5)) = nan;
B(randperm(sz, 5)) = nan;
% step 1-find missing values in both vectors
nanidx = isnan(A) | isnan(B);
% step 2- remove the indices in step 1
cleanA = A(~nanidx);
cleanB = B(~nanidx);
% step 3- calculate the correlation coeff.
R = corr(cleanA, cleanB); % NOTE: by default Pearson correlation is used in corr function
% step 4- report it
fprintf('Pearson R is %.2f\n', R)
Pearson R is -0.08
That's what Ive J's code does: removes all locations for which X is nan or Y is nan.

Sign in to comment.

More Answers (1)

Chunru
Chunru on 13 Aug 2021
Use "fillmissing" to fill up the nans before computing the correlation. doc fillmissing for more details.

7 Comments

I have tried function with syntax A=fillmissing(A,'movmean',5); it worked for some cases and corr value is above 95% and for some cases corr is NaN. I don't think it is reasonable to change the window value i.e 5 everytime as per our requiremnts.
fillmissing() cannot work in cases where the NaN are at the beginning or end.
Agree with you. Thank you
fillmissing has a lot of options:
A = [NaN NaN 5 3 NaN 5 7 NaN 9 NaN;
8 9 NaN 1 4 5 NaN 5 NaN 5;
NaN 4 9 8 7 2 4 1 1 NaN]
A = 3×10
NaN NaN 5 3 NaN 5 7 NaN 9 NaN 8 9 NaN 1 4 5 NaN 5 NaN 5 NaN 4 9 8 7 2 4 1 1 NaN
F = fillmissing(A,'linear',2,'EndValues','nearest')
F = 3×10
5 5 5 3 4 5 7 8 9 9 8 9 5 1 4 5 5 5 5 5 4 4 9 8 7 2 4 1 1 1
Thank you for the code. I have some doubts for this case. Does it make sense to fill the NaNs with such values having such a difference? Does the correlation obtained from this is justifiable?
Anyway, if you have data with so many NANs, you need to doubt your data first before doubting the processing techniques. There is not fool proof technique for filling missing data. It all depends on what data you have and what you want.
Mathematically, if you have vectors A and B, then
cAB = corr(A,B);
P = randperm(numel(A));
pA = A(P);
pB = B(P);
cpAB = corr(pA, pB);
then cAB needs to equal cpAB to within round-off. The order of the elements relative to each other in their same vectors do not matter: only the correspondance between the two vectors matter.
If, though, you were to fillmissing(A) and compare that to fillmissing(pA) then you would get different results, because fillmissing works based upon nearby values, under the assumption there is some kind of smooth continuity. This is not really compatible with the mathematics of correlation which does not care about order within the sequence.
If you have some prediction function for your vectors, then Yes, it might make sense to apply that prediction function. It might even make sense to apply something like narx to predict in some cases. But that would have to be done based upon knowledge of what the vectors represent. fillmissing() has no knowledge of what they represent.

Sign in to comment.

Categories

Products

Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!