Reservoir Sampling

https://news.ycombinator.com/rss Hits: 8
Summary

Reservoir sampling is a technique for selecting a fair random sample when you don't know the size of the set you're sampling from. By the end of this essay you will know: When you would need reservoir sampling. The mathematics behind how it works, using only basic operations: subtraction, multiplication, and division. No math notation, I promise. A simple way to implement reservoir sampling if you want to use it. Before you scroll! This post has been sponsored by the wonderful folks at ittybit, and their API for working with videos, images, and audio. If you need to store, encode, or get intelligence from the media files in your app, check them out! # Sampling when you know the size In front of you are 10 playing cards and I ask you to pick 3 at random. How do you do it? The first technique that might come to mind from your childhood is to mix them all up in the middle. Then you can straighten them out and pick the first 3. You can see this happen below by clicking "Shuffle." Every time you click "Shuffle," the chart below tracks what the first 3 cards were. At first you'll notice some cards are selected more than others, but if you keep going it will even out. All cards have an equal chance of being selected. This makes it "fair." Click "Shuffle 100 times" until the chart evens out. You can reset the chart if you'd like to start over. Shuffle 100 times Reset This method works fine with 10 cards, but what if you had 1 million cards? Mixing those up won't be easy. Instead, we could use a random number generator to pick 3 indices. These would be our 3 chosen cards. We no longer have to move all of the cards, and if we click the "Select" button enough times we'll see that this method is just as fair as the mix-up method. Select 100 times Reset I'm stretching the analogy a little here. It would take a long time to count through the deck to get to, say, index 436,234. But when it's an array in memory, computers have no trouble finding an element by its index. Now let me ...

First seen: 2025-05-08 18:11

Last seen: 2025-05-09 01:12