Bloom Filters by Example

https://news.ycombinator.com/rss Hits: 8
Summary

简体中文 Bloom Filters by Example A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set. The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to demonstrate: Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1. It's easier to see what that means than explain it, so enter some strings and see how the bit vector changes. Fnv and Murmur are two simple hash functions: When you add a string, you can see that the bits at the index given by the hashes are set to 1. I've used the color green to show the newly added ones, but any colored cell is simply a 1. To test for membership, you simply hash the string with the same hash functions, then see if those values are set in the bit vector. If they aren't, you know that the element isn't in the set. If they are, you only know that it might be, because another element or some combination of other elements could have set the same bits. Again, let's demonstrate: Test an element for membership: fnv: murmur: Is the element in the set? no Probability of a false positive: 0% And that's the basics of a bloom filter! Advanced Topics Before I write a bit more about Bloom filters, a disclaimer: I've never used them in production. Don't take my word for it. All I intend to do is give you general ideas and pointers to where you can find out more. In the following text, we will refer to a Bloom filter with k hashes, m bits in the filter, and n elements that have been inserted. Hash Functions The hash functions used in a Bloom filter should be independent and uniformly distributed. They should als...

First seen: 2025-06-29 13:37

Last seen: 2025-06-29 20:38