Recess
Sign in
← Exit
1 25
AILong-form4 min

What's a transformer? (the AI kind)

Before 2017, neural networks read text word by word. The transformer let them look at every word at once.

The transformer is the neural network architecture published in June 2017 in a paper called "Attention Is All You Need," by eight researchers at Google Brain and Google Research. Every model you've heard of in the current AI wave — GPT-4, Claude, Gemini, Llama, the open-source crowd — is a transformer or a close variant. The idea is simpler than the hype suggests, and worth understanding once.

Start with the problem the transformer was solving. Before 2017, the dominant architecture for language was the recurrent neural network (RNN), often in its LSTM form. RNNs process a sentence one word at a time, carrying a hidden "state" forward — read "the," update state, read "cat," update state, read "sat," update state. This worked, but it was slow (you can't parallelize a sequence you have to walk through in order) and it forgot. By the time the network reached the end of a long paragraph, the influence of the first sentence had been compressed and overwritten many times.

The transformer threw out recurrence entirely. Instead, it processes every token in the input simultaneously, and uses a mechanism called attention to let each token pull information from every other token directly. Here's the core trick: for every token, the model produces three vectors — a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("here's the actual content"). To figure out what token X should pay attention to, the model takes X's query, dot-products it with the key of every other token in the sequence, runs the scores through a softmax to turn them into weights that sum to 1, and uses those weights to compute a weighted sum of all the values. The token now has a representation that's a blend of every other token, weighted by relevance.

A worked example: in the sentence "The cat sat on the mat because it was tired," the word "it" needs to figure out who "it" refers to. With attention, when the model processes "it," its query strongly matches the key for "cat" (and weakly matches the key for "mat"). The resulting representation of "it" is mostly cat. With an RNN, this kind of long-range link had to survive being passed through every intermediate word's state update. With attention, it's a single direct lookup.

The "tokens" themselves aren't quite words. Before any of this, text gets chopped into pieces by a tokenizer — usually with a method called Byte-Pair Encoding (BPE), which builds a vocabulary of around 50,000 to 200,000 sub-word units. Common words like "the" become one token; rarer words like "transformer" might split into "trans" + "former." Each token then gets mapped to a learned vector (an embedding) of typically 1,024 to 12,288 dimensions, plus a position encoding so the model knows token order — because without recurrence, the math is otherwise position-blind.

A full transformer stacks dozens of these attention layers (GPT-3 had 96, GPT-4 has hundreds, exact count unpublished), each interleaved with a small feed-forward network. The model is trained on the simple objective of predicting the next token given everything before it. Do this with enough text — trillions of tokens, scraped from the open web, books, code — and enough parameters, and the network grudgingly learns grammar, facts, reasoning patterns, and whatever else is implicit in the training data. The 2020 paper introducing GPT-3 was the moment it became clear that scaling this one architecture, with no fundamental algorithmic change, kept producing qualitatively new capabilities.

That's the whole shape of it. Tokens in, embeddings, stacked attention layers where each token gets to look at every other token and decide what's relevant, predict next token, repeat. The original 2017 paper has been cited over 150,000 times as of 2025 — one of the most cited computer science papers of the decade. Almost everything labeled "AI" in a product release since 2020, from chatbots to image generators (which use transformer-based diffusion models) to protein-structure predictors like AlphaFold, traces back in some form to that one diagram of a stack of attention heads.

Swipe ← / → · scroll for more← / → keys · click arrows · scroll for more