How to generate random numbers correlated to a given dataset in matlab
Show older comments
I have a matrix x with 10,000 rows and 20 columns. I want to generate another new matrix of random numbers , y, where y is correlated to x with correlation coefficient q.
Note that the matrix x is not normally distributed - it has the power law distribution.
Answers (2)
John D'Errico
on 28 Jul 2015
Edited: John D'Errico
on 29 Jul 2015
7 votes
Ah. Every once in a while, I see a question come up that is interesting. In this case, it should not be difficult to do. In fact, I can see at least one solution, and maybe a second way to do so. (I'll post an answer later today. Must run out now. Sorry, but at least you can know to expect an answer if nobody else gives you one.)
A question first though. Since the mean of a variable has no impact on the correlation, do you care what the mean of y will be? Or can it simply have mean 0?
Next, I assume you mean the traditional correlation coefficient, thus the Pearson version?
Hmm, as I think, a more interesting question is, given a set of n variables, with their own set of inter-correlations, is can we choose a new variable that has a given set of n specified correlations with each of those n variables? And I think the answer is yes, of course we can do so, as long as we have sufficient degrees of freedom.
Before I go for now though, here is a fun paper on the subject.
Later... (m-file solution attached to this answer.)
The basic idea is for a variable x, find a new vector y0, such that y0 is orthogonal to x. Then choose some linear combination of x and y0 that has the desired correlation.
7 Comments
Greig
on 28 Jul 2015
It is an interesting paper you link to. It's always interesting and useful to think a little more deeply about concepts we take for granted.
Here is another PDF version that doesn't have some title an equations cut off..
John D'Errico
on 29 Jul 2015
Edited: Steven Lord
on 5 Feb 2017
I decided to write the solution as a function. It has a lot of internal comments. The basic idea that I chose for the solution was to find a second vector y0, that has ZERO correlation with x. Then find some linear combination of x and y0 that has exactly the desired correlation. I've attached my m-file solution to this comment.
x = rand(100,1);
tic,y = randwithcorr(x,.5);toc
Elapsed time is 0.007392 seconds.
corr(x,y)
ans =
0.5
Note that randwithcorr has ABSOLUTELY NO requirements about the distribution of x. x may be a vector or an array of any shape. Ok, two requirements, but they are small and very logical ones.
1. x must have at least 3 elements. Otherwise, it makes no sense to talk about a correlation with some other vector.
2. x must not be a constant vector. Again, it makes no sense to talk about correlation then.
I am quite confidant that I could do some optimization in this code, but it is pretty fast as it is, and I am feeling lazy right now. Too hot today to actually think. For a vector of length 1e6, it still takes only 0.4 seconds to run.
x = rand(1000000,1);
tic,y = randwithcorr(x,-.75);toc
Elapsed time is 0.412036 seconds.
corr(x,y)
ans =
-0.75
I've attached the solution m-file to my answer above, as well as to this comment.
[SL: Edited formatting of numbered list so you don't have to scroll to see the contents of each item.]
John D'Errico
on 29 Jul 2015
Edited: John D'Errico
on 29 Jul 2015
Oh, I just saw your comment that y should also follow the same distribution as x. This would make the problem very difficult if x has some completely arbitrary distribution. For example, suppose you had not told me at all what the distribution of x was? Almost as bad, even for simple distributions, it is often quite difficult to generate correlated random variables for other than normal distributions, where you specify things like correlations and covariances. Really, those parameters make the most sense in context of a Gaussian random variate. I've honestly never really seen any good treatment for generating correlated variates for something like Weibull, or exponential or gamma random variables.
RuiyangGe
on 5 Feb 2017
Dear John,
Could this function be extended to generate multi-variables with a fixed correlation coefficient between any pairs of these variables?
Thanks, Ruiyang
Aditya Nanda
on 17 Jan 2019
Thanks for the great answer, John.
I am working on the exact problem you mention in your answer.
Hmm, as I think, a more interesting question is, given a set of n variables, with their own set of inter-correlations, is can we choose a new variable that has a given set of n specified correlations with each of those n variables?
Can this be done? Please help! The correlation is the cosine of angle between two vectors (length is n). Can we construct a new vector that has specified correlations
to existing vectors
. All the vectors $x_1, x_2 \,\, \mathrm{etc. } $ are n-dimensional vectors and n>m
to existing vectors
. All the vectors $x_1, x_2 \,\, \mathrm{etc. } $ are n-dimensional vectors and n>m
Dian Fan
on 21 Apr 2020
John, the file only works for rho=1 or -1? It is strange that a non-unity rho gives errors.
Josué Ortega
on 14 Sep 2017
Edited: Josué Ortega
on 14 Sep 2017
This is a great answer @John D'Errico, but I have a further question. Suppose I have a vector 2 1 3 5 4 ... 10 of 10 numbers, from 1 to 10, ordered randomly, could be generated by
x=randperm(10).'
Now I want to generate another vector of 10 numbers, containing all numbers from 1 to 10 again, that has a correlation of at least p with my previous vector x. Any ideas? Your code works but of course produces real numbers between -1 and 1.
Categories
Find more on Random Number Generation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!