Goal : Choose a sample of $k$ items from a list $S$ containing $n$ items, where $n$ is either unknown or very large. For instance, a stream of numbers coming in.
Lets start with the procedure and then prove why this correct.
- Keep the first k items in an array (we call this a reservoir)
- When the $i^{th}$ item arrives, $i > k$
- With the probability $k/i$, keep the new item, discarding an old one selected uniformly at random from the reservoir.
- With probability $1-k/i$, reject this new item.
Now, lets take an example and work out this math, Lets take k = 10
- For the first $10$ items, add them to the reservoir with probability $1$
- When the $11^{th}$ item comes, it is
- added to the reservoir with probability $10/11$
- Now, what is the probability of an item $I$ which was already present in the reservoir when the $11^{th}$ item came ?
$p_{required} = $ (the 11th element is chosen and it replaces an element other than $I$ ) or (the 11th element is rejected)
\begin{align}
&= 10/11 * 9/10 +(1-10/11) \\
&= 9/11+1/11 \\
&= 10/11
\end{align}
So, all the elements in the reservoir stay with the probability of $10/11$ Lets work out what happens when the $12^{th}$ element comes.
- It is added to reservoir with probability $10/12$
- What is the probability that element $I$ (which was present in the reservoir)" will stay ?
$= 10/11 * ( 10/12 * 9/10 + (1-10/12) )$
-
$= 10/11 ( 9/12 + 2/12) = 10/11 * 11/12 = 10/12 $
How do we code this up ?
Input: Elements S[1..n], we want to sample k elements where $k << n$
- Allocate reservoir $R$ of size $k$
- for $i$ = $1$ to $k$
$R[i] = S[i]$ - For $i = k + 1$ to n
$j$ = $random(1,i)$ // inclusive
if $j <= k$
$R[j] = S[i]$
Lets see how this code is equivalent to what we computed earlier, in an inductive way.
- Lets assume after round $i - 1$, each element in the reservoir stays with the probability $k/(i-1)$
- For round i, we have
- New element stays with probability $k/i$
- Old element stays with probability :
$= \frac{k}{i - 1} (\frac{k}{i} * \frac{k - 1}{k} + (1 - \frac{k}{i}))$
$= \frac{k}{i-1}\frac{i-1}{i} $
$= \frac{k}{i} $ as required
Reservoir sampling with random sort
- Assign a random number as a key to each item
- Sort these items and maintain k items with minimum key values
Distributed Reservoir Sampling
- What if we have huge amounts of data? It is desirable to distributed sampling tasks among many machines in parallel
- A Simple approach: Distributed sort with a random key like above and pick top k elements .
Instead, we
- Distribute data among m machines
- Each machine $c$ does its own sampling using a random key and produces a sample of size <= $k$ items
- Collect all samples $n^{'} <= mk$
- Sample $k$ items from these $n^{'}$ items using the same random key
If we want to associate a weight $W_i$ with each item, then the random key can be scaled as $r^{1/w_i}$. That is, higher the weight bigger the key.
No comments:
Post a Comment