randsample issues and generating random numbers from large populations
Show older comments
The randsample function is pretty handy -- I supply a population in a vector and a number of samples I'd like drawn uniformly without replacement. Usually my population is a simple, increasing set of numbers, like 1 to 100000.
But I'm working now on a much larger problem where I'd like to generate, say, 500,000 samples from the population 1:2^60, or something huge. The problem is that the first thing randsample does is create a vector x = zeros(1,2^60), which of course breaks immediately as the maximum variable size is violated.
My question is this -- if my needs are simple enough, is it possible to generate the samples without storing the entire population in a vector? Could I generate them one at a time? Or could I somehow get the vector of samples as before but without storing the population?
A cursory idea was to generate smaller random numbers and concatenate them, but then I'm not sure what I'm losing on the theoretical side...
Any advice, statistical or MATLAB would be appreciated.
Accepted Answer
More Answers (2)
Derek O'Connor
on 1 Feb 2012
Here are two functions for taking a sample S of size Ns from a large, simply-defined population P of size Np. Both use O(Ns) space.
The first function samples with replacement and is very simple, with time complexity O(Ns).
The second function samples without replacement and this requirement makes things a lot more difficult. The only way I know of doing this is to use the Ulam-von Neumann acceptance-rejection principle: keep generating randomly until you get what you want. This is done by the while-loop that that tests the current random element r from P for membership in S, and rejects if r is in S. This rejection loop means that this method has a random running time, which, in principle, may be infinite.
In the rejection while-loop, we need to test for membership of S, as each random element r of P is generated. This can be done in two obvious ways:
- Use a "bit-vector" member of size Np and set "bit" member(r) when r is first chosen. Thus the membership test is O(1), but this solution is ruled out because Np is too big.
- Scan S(1:k-1) for r, when r is chosen from P. This uses no extra memorybut requires O(Ns) time for the membership test.
Thus we have the classic trade-off between time and space.
A Bigger Problem
Walter points out that there are (increasingly) problems with random number generators: we have lots of memory which allows us to generate huge samples or permutations (= a sample of size Np, from a population of size Np, without replacement.) However, current (or future?) random number generators cannot possibly access the complete sample space of all permutations of size 10^8, yet we can generate these large permutations in a few seconds or minutes. So, what part of the gigantic (10^8)! permutation space are these generators sampling from?
Related Topics:
- http://www.mathworks.co.uk/matlabcentral/answers/27515-random-sample-without-replacement
- http://www.mathworks.co.uk/matlabcentral/answers/879-does-matlab-have-a-birthday-problem
- http://www.mathworks.co.uk/matlabcentral/answers/12248-what-does-randperm-miss
% ---------------------------------------------------------------
function [S,pdup] = RandSampWR(L,U,Ns);
% ---------------------------------------------------------------
% Generate a random sample S of size Ns from the range of Np
% integers L:U, with replacement. Time and Space Complexity:O(Ns)
% Prob[Duplicates in S] ~ 1 - exp(-Ns^2/(2*Np))
% Derek O'Connor, 12 Mar 2011. derekroconnor@eircom.net
%
% USE: L=-2^50; U=2^50; Ns=2^25; [S,pdup] = RandSampWR(L,U,Ns);
% ---------------------------------------------------------------
S = zeros(1,Ns);
Np = U-L+1;
for k = 1:Ns
S(k) = L + floor(rand*Np);
end
pdup = 1-exp(-Ns^2/(2*Np));
% -----------------------------------------------------------------
function [S,Enw,pdup] = RandSampNR(L,U,Ns);
% -----------------------------------------------------------------
% Generate a random sample of size Ns from the range of Np integers
% L:U, without replacement, using a rejection loop.
% Time Complexity of member is O(Ns) and this is called nw times,
% which is a random variable that ranges from Ns to Inf.
% Thus Time Complexity is at least O(Ns^2) and is suitable for small
% Ns << U-L+1 only. Space Complexity is O(Ns).
% Note that this is a Las Vegas algorithm whose running time is
% random, because of the rejection while-loop.
% Derek O'Connor, 12 Mar 2011. derekroconnor@eircom.net
%
% USE: [S,Enw,pdup] = RandSampNR(-2^29,2^29,2^10)
% -------------------------------------------------------------
S = zeros(1,Ns);
Np = U-L+1;
S(1) = L + floor(rand*Np);
nw = 0;
for k = 2:Ns
r = L + floor(rand*Np);
while member(r,S,k-1)
r = L + floor(rand*Np); nw = nw+1;
end
S(k) = r;
end
Enw = nw/Ns;
pdup = 1-exp(-Ns^2/(2*Np));
% ------------------------------
function answer = member(e,v,k)
% ------------------------------
answer = false;
for i = 1:k
if v(i) == e
answer = true;
return
end
end
Peter Perkins
on 1 Feb 2012
Michael, if you are using the most recent version of MATLAB, R2011b, try this:
>> tic, randperm(2^53-1,10), toc
ans =
Columns 1 through 9
1.4197e+15 8.7423e+15 8.6214e+15 4.3719e+15 7.2083e+15 1.278e+15 3.7989e+15 8.2482e+15 7.1356e+15
Column 10
8.6423e+15
Elapsed time is 0.000610 seconds.
>> tic, randperm(2^53-1,5e5); toc
Elapsed time is 0.061926 seconds.
1 Comment
Walter Roberson
on 2 Feb 2012
That gets to 2^53, but Michael was hoping for 2^60.
Also, one of Derek's Questions a few months ago showed that with populations this large, you would not get a true permutation: you would have too many duplicate values coming out of the internal rand() [which has only 2^53-2 possibilities], and each duplicate value is sorted by order of appearance whereas a permutation would allow for either order.
Categories
Find more on Uniform Distribution (Continuous) in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!