TECHNOLOGY · BITE · 2 MIN · INTERMEDIATE

How Shazam Identifies a Song From Four Seconds of Noise

Avery Wang's 2003 paper turned every song into a constellation of dots, and that is what your phone is matching against in a noisy bar.

Avery Wang published the algorithm behind Shazam in a 2003 ISMIR paper, and the trick is older than the iPhone. Instead of comparing waveforms, his system reduces every track to a sparse map of peaks in a spectrogram — the loudest frequency points at each moment, plotted as dots on a frequency-time grid. A pop song becomes a constellation.

A fingerprint is then built from pairs of those peaks. Each pair encodes two frequencies and the time gap between them, hashed into a small integer. One song produces hundreds of thousands of these little hashes. They are robust because the peaks tend to survive distortion: the music in your café is compressed by the speakers, smeared by the room, and buried under chatter, but the loudest harmonics generally make it through.

When your phone listens, it builds the same kind of hashes from the four-or-so seconds of audio it captured and ships them to Shazam's servers. The server looks each hash up in an index of every track in the catalog. Most matches are noise. The trick is to look not just for matching hashes, but for matching hashes that line up at a consistent time offset — a real song produces a tight diagonal line on a scatter plot of database time vs. query time. A coincidence does not.

Apple bought Shazam in 2018 for a reported $400 million, mostly to bake the recognizer into iOS. The math underneath has barely changed since the paper.

#algorithms#audio#music-tech#signal-processing

Sources

ISMIR 2003 Apple Newsroom

How Shazam Identifies a Song From Four Seconds of Noise

Make Recess yours.