Random sampling without replacement

A quick runtime analysis of methods for taking random samples without replacement

There are a few ways of picking random samples without replacement from a body of elements. One can imagine naively selecting an element at random from the list of elements and removing it so that it cannot be selected during the next sample. It is also straightforward to imagine simply keeping tracking of those elements that have already been selected and randomly sampling the whole population each time until an element is found that was not previously sampled. It turns out that both of these methods have their place when it comes to randomly sampling, and should be used in different contexts. This writeup serves to provide a basic analysis of each method (and perhaps when they should be used).

Iterative Reselection

When generating a sample without replacement, we can simply keep track of the elements that have been selected and iteratively reselect elements from the entire population until finding an element that has not previously been chosen. In the case where it is possible to generate elements of the population (i.e. permutations or combinations of a set), this approach can be space efficient as it does not require enumerating all members of the (possible massive) population. The main question becomes how many reselections should we expect to have to make before arriving at the desired number of unique random samples?

Let NN be the size of population at hand (i.e. the number of possible permutations of a set) and mm be the desired number of distinct random samples. The probability of sampling an element that has not yet been sampled after jj items have already been sampled is NjN\frac{N-j}{N}. This matches the intuition that it becomes increasingly less likely to sample a distinct element after having sampled more and more elements. As a result, we might anticipate that the number of expected reselections will increase drastically with greater values of mm. Now let XiX_i be a random variable taking on the number of selections required before selecting the next distinct element after ii selections have already been made. Note that each of the XiX_i’s are geometric random variables:

P(Xj=k)=(1NjN)k1(NjN)=(jN)k1(NjN)P(X_j = k) = \bigg(1-\frac{N-j}{N}\bigg)^{k-1}\bigg(\frac{N-j}{N}\bigg) = \bigg(\frac{j}{N}\bigg)^{k-1}\bigg(\frac{N-j}{N}\bigg)

with expected value E[Xi]=NNj\mathbb{E}[X_i] = \frac{N}{N-j}. Then the random variable X=X0++Xm1X = X_0 + \cdots + X_{m-1} is the total number of selections required before arriving at the final sample set of size mm. The expected value of XX tells us the average number of selections we will have to make before getting a result, which is the value we set out to find:

E[X]=E[X0++Xm1]=E[X0]++E[Xm1]=j=0m1NNj\begin{aligned}\mathbb{E}[X] &= \mathbb{E}[X_0 + \cdots + X_{m-1}] \\ &= \mathbb{E}[X_0] + \cdots + \mathbb{E}[X_{m-1}] = \sum_{j=0}^{m-1}{\frac{N}{N-j}}\end{aligned}

This sum can be approximated with the integral

0mNNjdj=N(logNlog(Nm))=Nlog(NNm)\int_0^m{\frac{N}{N-j}}dj = N(\log{N} - \log{(N-m)}) = N\log{\bigg(\frac{N}{N-m}\bigg)}

which is O(NlogN)O(N\log{N}). More to come on this writeup soon.