BLOOM FILTER · BITE · 2 MIN · BEGINNER

The Data Structure That Lies on Purpose

A Bloom filter can tell you a username is taken using a few bits per entry. It might also be wrong, in one specific direction.

Burton Howard Bloom published the trick in a 1970 paper while working at a computer company in Boston. His problem was hyphenation: a typesetting program needed to check, for every word, whether it was in a dictionary of exceptions. Storing the dictionary in memory was expensive. Going to disk for every lookup was slow.

Bloom's answer was a fixed-size bit array and a handful of independent hash functions. To add a word, hash it through each function and set the corresponding bits to 1. To check membership, hash again and inspect the same bits. If any of them is 0, the word is definitely not in the set. If all are 1, the word is probably in the set — with a tunable false positive rate.

The asymmetry is the whole point. Bloom filters never give a false negative. They sometimes claim a word is in the set when it isn't, and you accept that in exchange for using a tiny fraction of the memory a real dictionary would need. With about 10 bits per element you get a false-positive rate near 1%.

Modern systems lean on this hard. Google Bigtable and Apache Cassandra use Bloom filters to skip disk reads for keys that aren't there. Bitcoin clients used them to request only relevant transactions. Web browsers screen URLs against malware lists. Spell checkers still run them.

A structure that admits its own uncertainty turned out to be a quietly profound design choice.

#bloom-filter#data-structures#computer-science#hashing#algorithms

Sources

Wikipedia Communications of the ACM

The Data Structure That Lies on Purpose

Make Recess yours.