TECHNOLOGY · BITE · 2 MIN · INTERMEDIATE

The Bloom Filter Lies to You on Purpose

A Bloom filter can guarantee 'not present' — but 'yes' is always a guess. The false positives are structural.

Burton Howard Bloom invented the Bloom filter in 1970 while working at Computer Usage Company, and published it in the Communications of the ACM under the unassuming title "Space/Time Trade-offs in Hash Coding with Allowable Errors." The 'allowable errors' in the title are not a flaw — they are the entire point.

A Bloom filter works like this: you have a bit array of size m, initialized to zeros. When you add an element, you run it through k independent hash functions, and set the bit at each of the k positions they return. To query whether an element is present, you hash it the same k ways and check those bit positions. If any bit is 0, the element was never added — certain. If all bits are 1, the element was probably added — but maybe not, because a different combination of elements could have set the same bits.

The false positive rate is a function of m (the array size), k (the number of hash functions), and n (the number of elements inserted). You can tune the tradeoff: a larger array gives fewer false positives. But you can never achieve zero false positives without losing the space advantage that makes Bloom filters worthwhile in the first place.

Google used Bloom filters in Bigtable to avoid disk lookups for keys that don't exist. Akamai used them to avoid caching URLs that were only requested once — a one-hit-wonder filter. Cassandra uses them to skip SSTables that definitely don't contain a row. In every case, the value is the same: a fast, compact structure that eliminates expensive work when the answer is 'no', at the cost of occasionally being wrong about 'yes'.

#data-structures#algorithms#databases#probabilistic

Sources

Communications of the ACM, 1970 Wikipedia

The Bloom Filter Lies to You on Purpose

Make Recess yours.