An interactive primer

How LLMs Actually Work

Trace one message through the whole machine — then take every piece apart with your own hands. No math required; bring only curiosity.

18 min read·Interactive·Updated Jun 2026

Reader's contract. You are smart and curious but you are not an ML engineer, and you don't want to become one. You want to understand — well enough to look a founder or a researcher in the eye and know whether their claim holds water. This document leads with pictures and analogies, defines every piece of jargon the moment it appears, and never makes you read a wall of text when a diagram would do. Math stays in the basement; we'll only come upstairs for it when a number actually changes how you think.

On-ramp  Trace one message through the whole stack

Watch the whole machine run once

Before we take anything apart, let's watch the whole machine run once, end to end, on a single real message. Everything in the rest of this document is just a zoom-in on one of these steps. Keep this picture in your head; we'll hang every later idea off it.

You type into a chat box:

"How many r's are in strawberry?"

You hit enter. To you it feels instant and obvious. To the model, your sentence is about to go through five transformations before a single word comes back. Here is the journey.

One message, six stages.

Everything later in this article is a zoom-in on exactly one of these boxes. The model never sees letters — only the chunks in stage 1.

STAGE 1
Tokenize
Split the sentence into chunks.
STAGE 2
Embed
Each chunk becomes a vector.
STAGE 3
Attention
Tokens read each other.
STAGE 4
Predict
Rank every possible next token.
STAGE 5
Sample
Pick one (temperature decides how boldly).
STAGE 6
Output, loop
Append it, run again — one token at a time.

Illustrative pipeline. Each stage gets its own section below.

Stage 1 — Tokenize. The model can't read letters. The first thing that happens is your sentence gets chopped into tokens — chunks of text, usually a word or a fragment of a word, that the model was taught to recognize as atomic units. "How", " many", " strawberry" might each be a token; longer or rarer words get split into pieces. Crucially, the model now sees chunks, not letters — remember this; it's the whole reason the strawberry question is hard for it.

Stage 2 — Embed. Each token is turned into a long list of numbers — a vector, which is just a coordinate that places the token somewhere in a vast "meaning-space." Tokens with similar meanings land near each other. The word is gone; a position in space has replaced it.

Stage 3 — Attention / Transformer. Now the model lets every token look at every other token and decide which ones matter for understanding it. This is attention, and it's the engine. "r's" looks back at "strawberry" and "how many" to figure out what's being counted. This happens in stacked layers, each one refining the picture.

Stage 4 — Predict. After all that looking-around, the model produces one thing and one thing only: a giant ranked list of every possible next token, each with a probability. It is, at heart, the world's most sophisticated autocomplete. For our prompt, the top candidates might be "There", "The", "Straw…", each with a score.

Stage 5 — Sample. From that ranked list, the model samples — picks one token. How adventurously it picks is controlled by a dial called temperature (we'll play with it later). Pick the safe top choice, or roll the dice on a lower-ranked one.

Stage 6 — Output, then loop. The chosen token is shown to you, then appended to the input, and the whole pipeline runs again to pick the next token. And again. One token at a time, looping, until it produces a special "I'm done" token. That streaming you see in a chat window? That's this loop, live.

The punchline you should already feel: the model never "counts the r's." It pattern-matches its way to an answer, token by token, having never seen the individual letters in "strawberry" at all. That's not a bug in one model — it's a direct consequence of Stage 1. By the end of Section A you'll understand exactly why, and you'll never be fooled by a "the AI can't spell" headline again.

Now let's earn that understanding. We'll walk the same pipeline again — slowly, properly — as the life story of a model: how one is born, and then how it's used.

Section A  The lifecycle of a modern LLM

Born, shaped, then used

Here's the spine of this whole section, the story arc we're about to tell. A model isn't programmed. It's grown, then shaped, then used. Eight stages, one continuous story. Everything up to the final stage happens once, in a data center, over months — the BIRTH of the model. The last stage, inference, happens every single time you send a message — the model's LIFE. Let's go.

A1  Tokenization

Teaching the model an alphabet of its own

A model can't see text the way you do. Before anything else, we have to convert writing into numbers, and the very first decision is: what is the smallest unit the model is allowed to see?

The naive answer is "letters." It fails: spelling out everything letter by letter makes sequences impossibly long and throws away the obvious fact that "running" and "runner" share a root. The other naive answer is "whole words." That fails too: there are millions of words, names, and typos, and the model would be helpless the first time it met a word it had never seen.

The field's answer is a beautiful compromise called subword tokenization — most commonly a scheme named Byte-Pair Encoding (BPE).1 The idea: start with small units, then repeatedly glue together the pairs that show up most often, until you've built a vocabulary of common chunks. Frequent words ("the", "strawberry") become single tokens; rare words get assembled from a few pieces ("tokenization" → "token" + "ization"). Modern models run BPE not on characters but on raw bytes (byte-level BPE, introduced with GPT-22) — which is what guarantees that anything the model meets, even a word it's never seen, an emoji, or a stray symbol, can always be spelled out from smaller fragments as a last resort. Nothing is ever un-representable.

See it as the model sees it.

Type anything. Watch words land as a chunk or two — not as letters. That gap is why it miscounts r's.

TOKENS
characters: 0  ·  tokens: 0

Token splits are precomputed and illustrative; real tokenizers vary by model. IDs shown for "strawberry": straw = 15140, berry = 19772.

This one design choice has consequences that ripple through everything:

So: we've turned writing into a stream of token-IDs. But an ID is just a name tag — the number "5176" tells the model nothing about what "strawberry" means. That's the next problem.

A2  Embeddings

Giving every token a place in meaning-space

A token-ID is arbitrary. We need to convert each one into something that actually carries meaning. The trick: represent every token as a long list of numbers — a vector — that you can think of as coordinates in a high-dimensional space of meaning. (High-dimensional just means "lots of coordinates" — hundreds or thousands per token, instead of the three we live in. Don't try to picture it literally; picturing 3-D and trusting the math is enough.)

12,288
coordinates in a single GPT-3 token vector. We flatten it to 2 below — the intuition survives.

The magic property: the model learns these coordinates so that tokens with similar meaning land near each other. "King" sits near "queen." "Paris" sits near "France." This learned vector is called an embedding. Directions in the space can even encode relationships — the famous result that king − man + woman ≈ queen.3 (One honest caveat: that clean piece of arithmetic comes from an earlier, static kind of word embedding — word2vec, 2013 — where each word has one fixed vector. The token embeddings inside an LLM use the same near-means-similar idea, but they don't stay fixed: the very next stage adjusts each one based on context. So treat king−queen as the intuition pump it is, not a literal operation happening inside GPT.)

Words become coordinates.

Related words land together. Watch the arithmetic: take king, apply the same step that turns man into woman, and you arrive at queen. The offset itself carries the meaning.

Illustrative; real embeddings have thousands of dimensions, flattened here to two. The king−man+woman result comes from static word2vec embeddings, not from inside an LLM.

Why does this matter so much? Because once meaning is geometry, reasoning starts to look like arithmetic the machine can actually do. The model isn't shuffling words; it's moving points around in a space where "closer" means "more related." Every later stage operates on these vectors, never on the text.

But there's a gap. Right now each token's vector is fixed — "bank" has one location, whether you mean a riverbank or a savings bank. Meaning in real language depends on context. We need a mechanism that lets each token adjust itself based on its neighbors. That mechanism is the heart of the whole revolution.

A3 & A4  The Transformer and Attention

Letting words read the room

This is the engine. It's worth slowing down, because if you understand this one idea, you understand why the last decade happened.

The problem the field was stuck on. Before 2017, the leading approach read text the way you'd read through a straw — one word at a time, left to right, trying to cram everything it had seen so far into a single running "memory." (These were called RNNs, recurrent neural networks — recurrent meaning they looped over the sequence step by step.) Two fatal flaws: they forgot the beginning of long passages by the time they reached the end, and because each step depended on the one before it, they couldn't be sped up by doing the work in parallel. Training was slow, and long-range understanding was poor.

The 2017 breakthrough — a paper bluntly titled Attention Is All You Need4 — threw out the straw entirely. Its architecture, the Transformer, lets the model look at many words at once and, for each word, decide which other words it should pay attention to. That's it. That's the idea. It's called self-attention: a token gets to ask other tokens "how relevant are you to me, right now?" and weight them accordingly. One crucial detail for the chat models you actually use: they read left-to-right and look only backward — each token can attend to the words that came before it, never the ones still to come. (It has to be this way: when the model is predicting the next word, the future words don't exist yet.) This is called causal (or masked) attention.

Here's the analogy that makes it click. Picture a dinner-party conversation. When someone says "it," they mentally check back over what's already been said: what does "it" refer to? — and the word "it" effectively turns up the volume on the noun it points back to, and turns down the irrelevant chatter. Each word builds its understanding by selectively listening to the room. Where the metaphor stops: in a chat model, a guest can only hear the people who spoke before them — nobody hears the future. That's the causal rule above, and it's why the model can generate one word at a time at all.

Hover a word. Watch it look backward.

Earlier words light up by how hard the hovered word attends to them. Forward words grey out — the model never peeks at the future.

Tap a word to lock it; hover to peek. Each word can only look backward.

Attention weights are precomputed and illustrative. Causal rule enforced: a word can only attend to words before it.

Two design notes that pay off later:

So now we have an architecture: a tall stack of attention layers that turn a sequence of token-vectors into a rich, context-aware understanding, and finally into a prediction of the next token. But an architecture is an empty engine. It knows nothing yet. We have to fill it with knowledge. That's training, and it comes in three escalating acts.

A5  Pre-training

Reading the internet to learn the world

This is where a model gets its raw intelligence, and it's astonishingly simple to state: predict the next token, over and over, across a huge slice of human writing.

That's the entire objective. Show the model "The capital of France is ___" and have it guess; if it guesses wrong, nudge all those billions of internal numbers (called parameters — the knobs the model learns) a hair in the direction that would've been right. Do this trillions of times, over books, code, websites, and forums, and something remarkable happens: to get good at predicting text, the model is forced to learn the patterns behind the text — grammar, facts, a little arithmetic, the structure of an argument, the rhythm of a story. Understanding is a side effect of relentless autocomplete.

The pre-training loop.

No human grades anything. The internet IS the answer key — the next word is always sitting right there in the text.

1
Read text
A snippet from the internet: books, code, web.
2
Predict next token
Guess the word that comes next.
3
Check the answer
The real next word is right there — the internet IS the answer key.
4
Nudge the weights
Adjust slightly so the guess gets better. Repeat.
↺ × 1,000,000,000,000

Schematic of the self-supervised next-token objective. Input funnel: books + code + web + conversations.

$100M+
estimated cost of a single frontier pre-training run — months on tens of thousands of GPUs.

This stage is why models are so expensive and why only a handful of players do it: it eats months of time on tens of thousands of GPUs at an estimated cost of tens to hundreds of millions of dollars for a frontier run.12 And it raises the central economic question of the field: given a fixed pile of money and compute, should you build a bigger model or feed it more data?

For a while everyone chased size — bigger model, bigger headlines. Then in 2022 a paper nicknamed Chinchilla showed the field had been doing it wrong: most big models were undertrained — too many parameters, too little data — and you'd get a smarter model for the same cost by making it smaller but feeding it far more text.5 The takeaway, now lore: data and model size must scale together. A model isn't "better" because it's bigger; it's better when its size and its training data are balanced for the compute you spent.

Bigger isn't smarter. Balanced is smarter.

For a fixed compute budget, error bottoms out where parameters and data are balanced — this chart quietly reset how every lab budgets a run.

After Hoffmann et al. 2022 (Chinchilla). Illustrative U-curve; no equations.

Why this matters for evaluation. When a startup brags about "a trillion-parameter model," the right question isn't "how big?" — it's "how much did you train it, and on what?" Parameter count alone is a vanity metric. Data quality and quantity are where models are actually won or lost.

At the end of pre-training you have a base model: a sprawling, knowledgeable, deeply weird text-predictor. It is not a helpful assistant. Ask it a question and it might continue with five more questions, because on the internet, questions are often followed by more questions. It has knowledge but no manners, no sense that it's supposed to help you. Fixing that is the next two acts.

A6 & A7  Fine-tuning and post-training

Turning a know-it-all into an assistant

A base model is a brilliant, feral library that talks like the average of the internet. Post-training is the finishing school that turns it into the polite, helpful "assistant" you actually chat with. It's where a model gets its personality and its alignment — and it happens in steps.

Step one: Supervised fine-tuning (SFT) — show, don't tell. We collect a pile of high-quality example conversations — a human writes an ideal answer to a prompt — and we fine-tune the base model on them. Fine-tuning just means more training, but now on a small, curated set instead of the raw internet. The model learns the format of being helpful: when you ask a question, you answer it; you don't ramble; you follow instructions. This is imitation — the model copies good examples.

Step two: learning from preferences — rank, don't script. Imitation has a ceiling: humans can't hand-write an ideal answer to every possible prompt, and "good" is often a matter of taste and degree. So we switch from showing to judging. We have the model produce two answers, and a human (or another model) says "this one's better." Do this across mountains of comparisons, and you can teach the model to produce answers humans prefer — more helpful, more honest, less likely to confidently make things up.

The landmark here is InstructGPT / RLHFReinforcement Learning from Human Feedback.6 The recipe: use all those human preference judgments to train a reward model (a model that scores how good an answer is), then use reinforcement learning to push the assistant toward higher-scoring answers. RLHF is the single biggest reason ChatGPT felt like a leap over raw GPT-3: same underlying knowledge, radically better behavior. (The full machinery of how RL actually works is the hard part — that's exactly what the RL section below is for.)

From feral library to assistant.

Same brain the whole time — we're not adding knowledge, we're shaping behavior. Helpfulness rises left to right.

PANEL 1
Base model
Spouts raw internet text — knowledgeable, but unhelpful and unsteerable.
PANEL 2 · + SFT
Imitation
Answers in a clean assistant format — learns how to respond by copying good examples.
PANEL 3 · + RLHF
Judgment
Learns which responses humans actually prefer — the step where it grows up.

Schematic escalation. Same model throughout — each stage shapes behavior, not knowledge.

A few things worth internalizing, because they're where evaluation gets sharp:

The model is now born and raised: knowledgeable from pre-training, helpful from post-training. It sits frozen, finished, weighing in at billions of parameters. Now — finally — someone sends it a message. That's the last stage, and it's the only one that happens every single time you hit enter.

A8  Inference

The model, in use, one token at a time

Inference is the model running — taking your prompt and generating a reply. This is the loop from the on-ramp, and now you have the full picture of what's happening inside each step. Your message gets tokenized, embedded, and pushed up through the whole stack of attention layers, which produces a ranked list of likely next tokens. One is chosen. It's appended to the conversation. The whole thing runs again for the next token. And again. Word by word, which is exactly why replies stream onto your screen rather than appearing all at once.

Generation is one frozen loop.

No learning happens here — the parameters are locked. Rank the next token, pick one, append, repeat.

1
Prompt → tokens
Text is split into tokens.
2
Tokens → vectors
Each token becomes a list of numbers.
3
Up the attention stack
Layers let tokens read each other for context.
4
Rank next tokens
Out comes a ranked list of likely next tokens.
5  🔒
Pick one, append
Choose one, add it, run the whole thing again.
↺ next token  ·  this happens once per token, ~dozens of times per sentence

🔒 Parameters are frozen — no learning happens here; it's pure read-out.

Birth vs. life. Everything before inference — months, once, in a data center. Inference itself — a fraction of a second, every message, for every user on Earth.

Two things make inference the part of the lifecycle that businesses obsess over:

Temperature = creativity dial

Same probabilities, different boldness. Temperature is the user's dial between safe and creative.

FIXED PROMPT
The weather today is ___
Temperature: 0.8

Illustrative distribution; real vocabularies are ~100k tokens. Softmax-with-temperature, precomputed.

That's the full lifecycle: an empty architecture, filled with world-knowledge by pre-training, taught manners by post-training, and finally run, token by token, every time you ask it something. One story, eight stages, start to finish.

But that story took one thing for granted at every step: the architecture underneath it. We kept saying "the model" as if its shape were obvious — yet a decade ago almost nobody would have built it this way. So before we go any deeper, it's worth asking the question that quietly explains the whole modern era: why this design, and not one of the dozens that came before it? The answer isn't really about language. It's about the machines we feed.

Section B  Why Transformers won

The idea that fit the machine

We just watched attention work: every word turns up the volume on the words that matter to it and tunes out the rest. That's the what. But a clever idea isn't enough to flip an entire industry — plenty of clever ideas die in a drawer. So the real question, the one that explains why the 2020s look the way they do, is: why did this architecture beat everything that came before, so completely that the whole field abandoned the old way almost overnight?

The short answer is going to surprise you. Transformers didn't just win because they understood language better. They won because they were shaped exactly right to be fed by the machines we happen to build — and that single fact is the thread that ties the math of the model to the dollar cost of running it. Let's earn that.

B1  The old way — reading through a straw

One word at a time, memory smearing as you go

To feel why attention was a breakthrough, you have to feel the pain it cured. So rewind to before 2017.

The leading approach to language back then was the RNN — a recurrent neural network, where "recurrent" just means it loops over the text one piece at a time. (You met this briefly in Section A; now we open it up.) Picture reading a sentence through a straw: you can see exactly one word, you read it, you update a little running summary in your head — a single mental "state" meant to hold everything important so far — then you slide the straw to the next word and repeat. The model never sees the whole sentence laid out; it sees a parade of single words and a memory it keeps rewriting.

The most famous version, the LSTM (Long Short-Term Memory), was a genuinely brilliant patch on this idea: it added little gates that decided what to keep in memory and what to forget, so the running summary wouldn't get instantly overwritten.15 For years, LSTMs and their variants were the leading approach to language. (There were also CNNs — convolutional networks borrowed from image processing — used on text by sliding a small window across the words; faster than RNNs, but they only ever looked through a fixed-size window, so distant words still couldn't easily talk to each other.) The straw got better. It was still a straw.

And the straw had two flaws that no amount of cleverness could fully fix:

Flaw one — the memory fades. Everything the model knows about the sentence has to be squeezed, at every step, into that one running summary. By the time it reaches the end of a long paragraph, the beginning has been overwritten dozens of times — diluted, smeared, half-forgotten. This is the long-range dependency problem: in "The strawberry that I picked from the garden behind my grandmother's old house last summer was ripe," the word "was" needs to connect back to "strawberry," but fifteen words of memory-rewriting sit in between. The signal has to survive a game of telephone. Often it doesn't.

Flaw two — and this is the one that actually decided the war — you cannot do the work in parallel. Because each step depends on the running summary produced by the step before it, the model must process word 1, then word 2, then word 3, strictly in order. Word 50 cannot be computed until words 1 through 49 are done. There's no skipping ahead, no splitting the labor.

Reading through a straw

Each step has to wait for the one before it — and the earliest words fade as the running summary is rewritten over and over.

The → state → 🔒wait straw· → state → 🔒wait berry → state → 🔒wait …last → state → 🔒wait was

Faded boxes = words already smeared into the running summary. The 🔒 between every step is the bottleneck: step N cannot start until step N−1 finishes.

Schematic of the recurrent left-to-right chain. Fading is illustrative of the long-range dependency problem.

Hold onto that second flaw. It sounds like a mere engineering annoyance — so what if it's a little slow? But it's the hinge the whole story turns on, and here's why: the thing that makes modern AI work is scale — throwing enormous amounts of computation at enormous amounts of data. An architecture that forces you to do everything in single file can never absorb that much computation, no matter how much you're willing to spend. The straw had a speed limit baked into its shape.

B2  Self-attention's escape — every word talks to every word, at once

Throwing away the loop

Now the move that broke it open. The 2017 paper Attention Is All You Need4 did something that, in hindsight, looks almost reckless: it threw away the recurrence entirely. No more loop. No more running summary passed hand-to-hand down the line. No straw.

Instead — self-attention, the mechanism from Section A: every token looks at every other token directly, in a single step, and decides how much each one matters to it. (Recall the dinner-party guest turning up the volume on the words that are relevant to "it.") And here's the part that matters for this section, the part to really sit with:

Distance stops mattering. In the straw, connecting "strawberry" to "was" fifteen words later meant the signal had to survive fifteen rounds of memory-rewriting. In attention, "was" looks straight back at "strawberry" in one hop — the same single step it uses to look at its immediate neighbor. A word a hundred tokens away and a word right next door are exactly the same distance to attention: one direct link. The game of telephone is gone. The fading-memory problem isn't patched — it's structurally deleted.

Same sentence, two architectures

The RNN walks the hallway one door at a time. Attention is in the room with everyone at once — and a far word is no harder to reach than a near one.

RNN (sequential)

long-range link = survive every step in between

straw· that I summer was
time 6 steps, in order

Self-attention (parallel)

long-range link = ONE direct hop, same as a near one

time all links at once

Illustrative. The bold clay link marks the one long-range pair to watch; the faint web is the full all-to-all pattern.

That alone would make attention better at understanding. But better-at-understanding is not what wins an industry. The thing that won the industry is hiding in the words "in a single step."

B3  Parallelism — the unlock that made scale possible

A shape the hardware was starving to run

Here is the quiet revolution, and it's worth stating as plainly as possible because almost everything downstream depends on it.

Because attention looks at all the words at once instead of marching through them one-after-another, there is no longer a step that has to wait for the step before it. Word 50's attention can be computed at the very same instant as word 1's. The strict single-file ordering — the thing that throttled RNNs — is gone. The work can be spread out and done simultaneously.

Why does that change everything? Because of the hardware. A GPU (graphics processing unit — the chip originally built to draw video-game frames) is, at its core, a machine for doing thousands of small calculations at the same time. It is gloriously, ridiculously parallel. Feed a GPU a task that must be done in strict order, and most of those thousands of little workers sit idle, twiddling their thumbs, waiting their turn — which is exactly what an RNN does to a GPU. Feed it a task where everything can happen at once, and every worker lights up together.

Self-attention is the second kind of task. The Transformer didn't just understand language in a new way — it understood it in a shape the hardware was already starving to run.

The same chip, asleep or awake

The only difference is whether the math lets you use the whole chip at once. This is the whole ballgame.

RNN on a GPU

throughput~idle

The chip is mostly asleep; every core waits its turn while one bright worker steps across the grid.

Transformer on a GPU

throughputmaxed

Every core busy at the same time — the all-at-once math lights the entire grid.

Schematic of GPU core utilization. One clay core marks the lone active worker the RNN can keep busy.

This is the unlock. Once training could run in parallel across the whole sequence, you could throw vastly more computation at the problem in the same wall-clock time — which meant you could train on vastly more text, with vastly bigger models. And the central lesson of the modern era, the one we'll keep returning to, is brutally simple:

The architecture that could eat the most compute won. It wasn't necessarily the cleverest design in some abstract sense — it was the one that turned "spend more money on chips" directly into "get a smarter model." RNNs choked on scale; Transformers feasted on it. And the timing was perfect: scaling-laws research was just then showing that model capability climbs predictably as you pour in more compute5 — so the architecture that could actually absorb that compute, by running in parallel,4 was the one positioned to win. Once that became clear, the entire field pivoted, and it pivoted fast.

But I've been hand-waving with the word "compute." What, exactly, is the GPU doing thousands of times at once? Answer that, and the bridge from the model's math to its dollar cost falls right into place.

B4  Matmuls — the bridge between the math and the money

A tower of giant multiplication tables

Here is the single most useful thing a non-engineer can understand about how these models actually run, and it fits in one sentence: underneath all the talk of attention and layers, a Transformer is, almost entirely, a tower of giant multiplication tables.

The technical name is matrix multiplication — "matmul" for short. A matrix is just a grid of numbers; multiplying two of them means doing a huge batch of "multiply these, add them up" operations to produce a new grid. You don't need the procedure. You need this: attention is matmuls, and the feed-forward layers between attention steps are matmuls. When the model decides how much "was" should attend to "strawberry," that's a matmul. When it pushes each word's vector through a layer to refine its meaning, that's a matmul. Stack a frontier model's hundreds of layers, and running it once is billions upon billions of these multiply-and-add operations — and almost nothing else.

Now watch the whole thread snap together, because this one fact is load-bearing for half the rest of this document:

Matmuls are the bridge: MATH → HARDWARE → COST

The same fact that makes Transformers scale beautifully is the fact that makes running them cost what it does. Remember this chain — it explains both the magic and the bill.

perfect fit
every token

MATH HARDWARE COST

Schematic. The grids and cost bars are illustrative; the chain — and its two-way reading — is the point.

Sit with that, because it's the rare idea that pays off in two directions at once. The matmul is simultaneously why the technology works and why it's expensive — the bridge between the math on the whiteboard and the line item on the invoice. Most people understand one side or the other. You now hold both ends of the same thread.

B5  Hardware as the hidden hand — progress is gated by what chips do well

The silicon quietly selects the winner

Step back from Transformers specifically, because there's a bigger lesson here that will make you sharper about every future "breakthrough" claim you hear.

The instinctive story of progress is: someone has a brilliant idea, and the idea wins because it's brilliant. The truer story, especially in AI, is messier and more interesting: an idea wins when it's brilliant and it happens to fit the hardware we can build cheaply at scale. Progress is gated as much by what silicon does efficiently as by what's clever on paper.

Transformers are the cleanest example in history. The attention mechanism wasn't conjured from nothing in 2017 — pieces of it existed earlier. What changed is that someone built an architecture that was all attention and all matmul, with the recurrence stripped out — and that turned out to be the shape that let the GPUs we were already mass-producing run flat-out. An equally clever architecture that didn't fit the hardware — that demanded, say, lots of strict step-by-step ordering, or some operation GPUs are bad at — would have lost, no matter how elegant. We can't fully know, because the hardware-friendly idea is the one that got to eat all the compute and therefore got all the investment, all the engineering, all the refinement. The hardware doesn't just run the winning idea. It quietly selects which idea gets to win.

Why this matters for evaluation. When a founder pitches "a fundamentally new architecture that beats Transformers," the sharp follow-up isn't "is it clever?" — it's "does it map onto the hardware people actually own?" A design that's smarter on paper but fights the GPU (or can't ride the same massive supply chain of chips) starts the race with a boulder on its back. Many promising "Transformer killers" have stalled for exactly this reason: not because the math was wrong, but because the silicon wasn't on their side. Cleverness is necessary. Hardware-fit is what's decisive.

This is also the lens for understanding why so much frontier effort goes into co-design — tweaking the architecture and the chips toward each other. FlashAttention, for instance, didn't change what attention computes at all; it just reorganized how the computation moves data around inside the GPU's memory so the chip stops waiting around — and that single hardware-aware rewrite made attention dramatically faster and cheaper, with identical results.16 The lesson repeats at every level: in this field, knowing the hardware is knowing the algorithm.

B6  The catch — attention's cost grows with the square of the input

Double the text, quadruple the work

Everything above is the triumph. Now the limitation, because no honest account of why Transformers won can skip the price tag they carry — and this particular catch is the seed of half the open problems in the field.

Go back to the thing that made attention magical: every token looks at every other token. That's a wonderful property for understanding. It's a punishing property for cost, and the reason is just counting. If you have 10 tokens and each must look at all the others, that's roughly 10 × 10 = 100 little comparisons. Fine. But double the input to 20 tokens, and it's 20 × 20 = 400 — you doubled the text but quadrupled the work. Go to 100 tokens and it's 10,000 comparisons; 1,000 tokens, a million. The work grows with the square of the length, not in step with it. Engineers call this O(n²) — "order n-squared" — and it just means: as the input gets longer, the cost balloons far faster than the input itself.

The N×N square — root of the context window

Step the input up and watch the grid of token-pair comparisons explode. This single square is one of the hottest research frontiers in the field.

N=416 cells
N=16 → 256 cells
Sequence length N: 4 tokens
At N=4 the grid is a tiny 16 cells. Drag right: the text grows in a line, the work grows in a square.

Illustrative. Cost curve is n² normalized against a linear reference; the clay marker tracks the slider's N.

This one fact has consequences you've felt without knowing the cause:

Why this matters for evaluation, and where it points next. The O(n²) wall is one of the most active research frontiers in all of AI — entire approaches (linear-attention variants, state-space models, sparse and sliding-window attention, and more) exist primarily to dodge it. So when a startup claims a "1-million-token context" or a "Transformer-killer that scales linearly," you now know the real question: what did they give up to beat the square? Sometimes the answer is a genuine, clever win. Often it's an approximation that quietly degrades the very long-range understanding that made attention worth having in the first place. The quadratic cost is the tension every long-context and next-architecture claim is wrestling with — and it's exactly the thread we pick up in the sections on context windows, inference cost, and what might come after the Transformer.

So that's why Transformers won: they deleted the fading-memory problem with direct all-to-all attention, they broke the single-file bottleneck so training could finally run in parallel, and — the decisive part — that parallel, matmul-shaped work fit the GPUs we could mass-produce, turning money straight into intelligence. They won the way most technologies actually win: not purely on elegance, but on the marriage of a good idea to the machine that could run it. And the same all-to-all attention that made them brilliant is the n² catch that now defines their frontier. Keep that square in your back pocket — it's about to explain a lot.

State of play — fenced off because it dates. The mechanisms above will not.

STATE OF PLAY — June 2026
· The Transformer still anchors essentially every frontier model — GPT-5-series,
  Claude Opus 4.6/4.7, Gemini 3.1 Pro, DeepSeek (V3.2/V4). No challenger
  architecture has displaced it at the top, though hybrids are creeping in.
· The O(n²) attention cost remains the central scaling tension. Production
  long-context models lean on FlashAttention-style hardware-aware kernels plus
  sparse / sliding-window / linear-attention tricks and retrieval to fake very
  long context affordably — not pure all-to-all attention at length.
· State-space models (the Mamba line) and other sub-quadratic designs are the
  most-watched "after the Transformer" candidates, increasingly shipped as
  HYBRIDS (a few attention layers + many cheap layers) rather than full
  replacements. The reason is the lesson in B5: hardware-fit decides, and the
  GPU ecosystem is built around matmul-heavy attention.
Specific models/numbers will age fast; the mechanisms above will not.

Section C  Pre-training

Back in Section A we walked the whole lifecycle and gave each stage a single sentence. Pre-training got this one: "reading the internet to learn the world." We said it's where a model gets its raw intelligence, that it's the most expensive stage, and that it ends with a knowledgeable-but-feral base model that isn't yet an assistant. We even met Chinchilla in passing.

This is the zoom-in. By the end of it you'll understand exactly what the model is doing during those months in the data center, why it produces real knowledge out of a game that sounds almost too dumb to work, and where the whole approach hits a wall the field is openly worried about. Keep the lifecycle diagram from Section A in your head — we're standing on the box labeled Pre-training, and everything here is what happens inside it before the model ever gets its manners.

You are here.

Pre-tokenize
Pre-training
Post-training
Reasoning / RL
Serving

We zoom into the one box that takes months, tens of thousands of GPUs, and most of the money — and produces a brilliant, useless genius.

C1  One stupidly simple game, played a trillion times

Here is the entire objective of pre-training, with nothing left out:

Show the model some real text, with the next chunk hidden. Let it guess the hidden chunk. Check the guess against the truth. Nudge the model's internal numbers a hair toward what would have been right. Repeat — across roughly the whole readable internet.

That's it. That's the game. It's called next-token prediction, and it is the only thing happening in pre-training. (Recall from the on-ramp that a token is a chunk of text — usually a word or word-fragment — and that the model, at heart, is "the world's most sophisticated autocomplete." Pre-training is where that autocomplete is built.)

The reason this is so beautiful — and the reason the whole field rests on it — is that the answer key is free. Every sentence ever written is its own worked example: the next word is sitting right there in the text. Nobody has to label anything. You don't pay humans to say "the correct continuation of The capital of France is ___ is Paris." The text already told you. This is why pre-training can run on trillions of tokens — the supervision is baked into the data itself. (The jargon for this is self-supervised learning: the data supervises the model, no human grader required.)

Play one step of pre-training.

Guess the next token, then reveal what the model would predict. It isn't memorizing facts — it's learning a distribution over how the world tends to continue.

Example 1 of 2 · an easy one
nudge the knobs so the true token scores higher next time

Precomputed illustrative distributions. You just did one step. Now do it a trillion times, over everything humans have written — that's the whole training run.

Now the part that should genuinely surprise you. This game sounds like it should only teach the model to mimic surface text — to be a fancy parrot. But to get good at predicting the next token, the model is quietly forced to learn the machinery underneath the text:

None of these were taught directly. There is no "facts" stage and no "physics" stage. There is only next-token prediction. All of the model's knowledge, grammar, and rough world-model are side effects of getting very, very good at one autocomplete game. That's the single most important idea in this section, and it's worth sitting with, because it explains both why these models are so eerily capable and why they fail in the specific weird ways they do (the strawberry-counting from Section A is one — the letters were never in the game).

Everything fell out of one task.

predict the next token
the only objective
World facts
…is Paris
Cause & effect
…until it was full
Arithmetic
2 + 2 = …
Code syntax
…) { return …
Grammar
…agrees with the subject
Translation
chien → dog
Basic reasoning
therefore …
Story structure
…and the final couplet rhymed

Nobody programmed any of these in. They all condensed out of relentlessly practicing one prediction.

Why the field landed here. Before this, teaching a machine a skill meant collecting a labeled dataset for that skill — expensive, narrow, and capped by how much labeling you could afford. Next-token prediction broke the cap: the entire internet became free training data, and a single objective produced a general model. The bitter, repeatedly-relearned lesson of the last decade is that this dumb, scalable approach beats clever hand-built ones almost every time. That's why every frontier lab pre-trains the same basic way.

C2  The base model — a brilliant genius with no manners

So pre-training finishes. You now have a base model: billions of parameters (the internal knobs, tuned by all that nudging) holding a compressed, tangled imprint of human writing. It knows a staggering amount. And it is not a chatbot.

This trips up almost everyone, so let's make it concrete. A base model is a pure text-completer. It does exactly one thing: continue whatever text you give it, in the most statistically plausible way, based on the internet it ate. It has no concept that it's supposed to help you. Ask a raw base model:

"What's a good recipe for banana bread?"

…and instead of a recipe, it might continue:

"What's a good way to keep bananas from going brown? What's the best pan to use? These are the questions every home baker asks…"

Why? Because on the open internet, a question is very often followed by more questions — in forum posts, FAQs, listicles. The base model isn't broken. It's doing its job perfectly: predicting the most plausible continuation. It just turns out that "answer the user's question helpfully" is not the most statistically common thing that follows a question on the raw web. The model is, as we put it in Section A, brilliant but feral.

Same brain, different behavior.

Base model · after pre-training
It completes, it doesn't answer
"What's a good recipe for banana bread?"
"What's a good way to keep bananas from going brown? What's the best pan to use? These are the questions every home baker asks…"
Knows everything, helps with nothing.
Section D
post-training
Helpful model · after post-training
Same knowledge, taught to be useful
"What's a good recipe for banana bread?"
"Here's a simple loaf: 3 ripe bananas, 1⅓ cups flour, ⅓ cup melted butter, ¾ cup sugar, 1 egg, 1 tsp baking soda. Mash, mix, bake at 350°F for ~55 min…"
Same brain — now it answers.

Pre-training builds the raw intelligence. It does NOT build the helpfulness. Those are two different jobs.

This is the cleanest way to hold the whole architecture of model-building in your head:

That second job — the finishing school that turns this feral genius into the assistant you actually chat with — is the entirety of Section D. For now, just lock in the division of labor: pre-training makes it smart; post-training makes it helpful, and they are not the same thing. When you hear a lab talk about "the base model" versus "the chat model" or "the instruct model," this is the line they're drawing.

Why this matters for evaluation. A lot of a model's personality and safety lives in post-training, but almost all of its raw ceiling — what it could ever possibly know or reason about — is set here, in pre-training. If a startup is fine-tuning someone else's open base model, they're decorating an intelligence they didn't build, and its ceiling is mostly fixed. That's not necessarily bad — but you should know which job they're actually doing.

C3  The model is what it eats — data quality and the mixture

If pre-training is the model learning the world by reading, then the single biggest lever on what it becomes is what you let it read. The slogan, only half a joke: the model is what it eats. Garbage in, garbage out — at the scale of the entire internet.

There are two knobs here, and they're different:

1. Quality. The raw internet is mostly junk — spam, broken HTML, duplicated boilerplate, SEO sludge, toxic comment threads. Feed that in unfiltered and you get a model that has absorbed the internet's worst habits. So labs spend enormous, unglamorous effort filtering: deduplicating, stripping low-quality pages, scoring text for usefulness, scrubbing the worst content. The headline result the field keeps re-confirming is blunt: better-filtered data produces a better model at the same size and compute. The FineWeb work, for instance, showed that careful curation of web text measurably lifted downstream performance — the data recipe, not just the model, was the win.19

Most of pre-training is deciding what to throw away.

Raw internet — ~everything
spam · dupes · broken pages · forums · gold nuggets, all mixed
↓  dedupe
Deduplicated
strip near-identical copies
↓  quality-score
Quality-scored · toxic removed
drop low-value & harmful text
↓  format
Curated training corpus
clean, formatted, ready to feed
Illustrative: often the large majority of crawled text — well over half, and in aggressive pipelines the vast majority — is discarded.

Curating the diet is a real, guarded competitive edge — most of the engineering is deciding what's worth feeding it.

2. Mixture. Beyond cleaning, labs deliberately blend their data — so much web text, so many books, so much code, so much math, so much from each language. And the blend shapes the abilities. The most striking example is a widely-reported observation across lab tech-reports (rather than a single settled paper): adding more code to the training mix appears to make a model better at reasoning — even on tasks that aren't coding — apparently because code is unusually rich in clean, explicit, step-by-step logical structure. The effect is real enough that labs act on it; the precise mechanism and magnitude are still debated, so hold it as a strong regularity, not a law. Want a model that's good at structured thinking? Feed it more of the most structured text humans produce. The mixture is a dial on cognition, not just on vocabulary.

Same size, different diet, different mind.

Config A · code & math pushed up
Web
Books
Code
Math / STEM
Multilingual
Conversations
→ stronger reasoning & coding
Config B · multilingual pushed up
Web
Books
Code
Math / STEM
Multilingual
Conversations
→ stronger translation, more even cross-language quality

Same architecture, same size — the mixture is one of the most closely-guarded recipes in the field. (Sliders illustrative, not measured.)

Why the field moved here. Early on, "more data" meant "more web text," full stop. The field gradually learned that which data matters as much as how much — that you can buy capability not just with bigger crawls but with smarter curation and a deliberate mixture. This is also why "we have proprietary, high-quality data" is one of the more credible moats a startup can claim: when public high-quality text is finite (hold that thought — C4 and C6), a private corpus of clean, relevant text is genuinely valuable.

What it still can't do: data curation can't conjure knowledge that isn't in the data, and it can't fully remove the biases woven through human writing — it can only shift them. And every filtering choice is itself a judgment call about what's "good" text, which bakes the curators' assumptions into the model. The diet shapes the mind, including its blind spots.

C4  Chinchilla — how to spend a fixed pile of money

Now the one place in this section where a number earns its keep. Everything else here you can hold as intuition; this one result reshaped how every major lab budgets a training run, so it's worth feeling precisely.

Start with the core tension. A pre-training run has a roughly fixed budget — a fixed amount of compute (GPU-hours, which is to say money and time). You get to spend that budget on two things, and they trade off against each other:

Spend more on one and, for a fixed budget, you must spend less on the other. So: bigger model trained on less data, or smaller model trained on more data?

For years the field's instinct — codified in an influential 2020 paper from Kaplan and colleagues20 — leaned hard toward size. Make the model as big as you can; data was treated as the lesser priority. The result was an arms race of ever-bigger parameter counts (175 billion, 280 billion, 530 billion…), and headlines that measured a model by its size alone.

Then in 2022, DeepMind's Chinchilla paper21 ran the experiment properly — training over 400 models across a wide range of sizes and data amounts — and delivered the verdict that the giants had been doing it wrong. Their finding, in plain words:

Most of the big models were badly undertrained. They had too many parameters and had been shown too little data. For the same compute budget, you'd get a smarter model by making it smaller and feeding it far more text. Roughly: every time you double the model's size, you should also double its training data — keep them balanced.

They proved it the only way that counts: they built Chinchilla (70 billion parameters, trained on ~4× more data) and it beat Gopher (280B), GPT-3 (175B), and the 530B Megatron-Turing model — while being smaller and therefore cheaper to run. A model a quarter the size, trained right, won.

This is the marquee chart of the section. It deserves to be felt, not read:

Bigger isn't smarter. Balanced is smarter.

For any fixed budget there's a Goldilocks point. Too big is just as wrong as too small — and the whole industry had been overshooting to the right.

Compute budget medium

More budget slides the whole sweet spot down-and-right — you can afford a bigger model and more data. The optimum isn't a fixed size; it depends on what you're spending.

After Hoffmann et al. 2022 (Chinchilla). Illustrative U-curve; no equations, no hard axis numbers.

Why it changed everything. Chinchilla didn't just tweak a hyperparameter — it told every lab they'd been wasting compute by overshooting on size. Almost overnight, the bragging metric shifted: "how many parameters?" started to matter less than "how many tokens did you train on?" Modern models are, by Chinchilla's logic, often deliberately smaller than their predecessors but trained on vastly more data (many trillions of tokens) — which also makes them cheaper to run, a double win. When you internalize this, the Section A warning lands hard: a giant parameter count is a vanity metric. The right question is always "trained on how much, and on what?"

And now the limitation that sets up the rest of the section — the one Chinchilla can't answer. Its rule says "for more compute, use a bigger model and more data." But it quietly assumes you can always get more data. In the real world, high-quality text is finite. Chinchilla tells you the ideal recipe; it doesn't tell you what to do when you run out of the main ingredient. That ingredient shortage is the data wall — and it's the live debate of this era.

C5  Continued pre-training — you don't always start over

Before we hit the wall, one practical and often-misunderstood move. A finished base model isn't necessarily done being pre-trained. You can take that completed model and keep running the same next-token game on new text — fresher data, or a specialized corpus — to extend or specialize it without paying for a full training run from scratch. This is continued pre-training (sometimes "continual" or "domain-adaptive" pre-training).

Two common uses make it concrete:

You rarely start from zero.

Pre-training (from scratch)
months · tens of millions of $$$ · the whole internet
Base model v1
Continued pre-training
same next-token game, new/specialized data · days-to-weeks, a fraction of the cost
Specialized / freshened model

A finished base model is a launch point — keep feeding it the same game (C1) on new data to extend or specialize it.

Why this matters for evaluation. Continued pre-training is one of the main ways smaller players build something genuinely valuable on top of an open base model — a real domain specialist for a fraction of frontier cost. When a startup says "we trained our own medical model," this (plus Section D's fine-tuning) is often what they actually did, and it can be a legitimately strong product. The honest question is whether they added domain depth (credible) or just slapped a system prompt on someone else's model (much weaker).

What it still can't do: continued pre-training can add and freshen knowledge, but it can't fully overwrite what's baked in, and pushing too hard on a narrow corpus risks catastrophic forgetting — the model getting better at the specialty while quietly getting worse at general skills it used to have. It's a launch point, not a magic wand.

C6  The limits — where pre-training runs out of road

Pre-training is the engine of the whole field, but it has hard edges. You need these to evaluate any "we'll just scale up" claim, because "just scale up" is exactly what's getting harder.

The data wall — high-quality text is finite. This is the big one, and it's a genuinely open debate, not settled doom. Chinchilla says "more compute → use more data." But there's only so much good human-written text in existence. The most-cited estimate, from Epoch AI's Villalobos and colleagues, puts the usable stock of high-quality public text at very roughly a few hundred trillion tokens, and projects that frontier training could exhaust the high-quality public supply sometime in the second half of the 2020s if current trends hold.22 Past that point, you can't follow Chinchilla's advice anymore — there isn't enough fresh, clean data to balance ever-bigger models. This is the data wall, and it's reshaping strategy across the field. (See the State-of-play box for where that debate actually stands in mid-2026.)

Appetite meets supply.

We're not out of data — we're running low on cheap, high-quality, PUBLIC text. The race is now about what to do when the easy ingredient runs short.

Escape hatch 1
Synthetic data
models generating their own training text
Escape hatch 2
New modalities
video, audio, images as fresh signal
Escape hatch 3
Smarter use of existing data
better curation, more passes

Illustrative shapes after Villalobos et al. (Epoch AI) 2022/2024. The shape is the lesson, not the numbers.

Baked-in biases and a frozen knowledge cutoff. Because all knowledge is a side effect of the training data (C1), the model inherits the data's biases — its skews, gaps, and prejudices — and its knowledge freezes at the knowledge cutoff (the date the data ends). A pre-trained model literally cannot know about events after its cutoff; it will confidently reason as if the world stopped on that date. (This is one big reason real systems bolt on tools like web search — to paper over a limitation that's structural to pre-training.)

Raw scale has diminishing returns. Early on, every order-of-magnitude more compute bought dramatic jumps. Increasingly, it buys less per dollar — the curve is bending. Bigger-and-bigger alone is no longer the obvious path to a smarter model, which is a large part of why the field's energy has shifted toward post-training and reasoning (Sections D and E) rather than just inflating pre-training. Scale still matters; it's just no longer a free lunch.

And the one we opened with: a base model is unsafe and unhelpful on its own. Pre-training, by itself, never produces a usable assistant — only a feral text-completer (C2). It will happily continue toxic text, make things up fluently, and ignore your actual request. Everything that makes a model safe, steerable, and helpful comes later. Pre-training builds the raw mind; it does not, and cannot, build the manners.

That hand-off — from a brilliant, dangerous, finished base model to the polite assistant you actually talk to — is precisely where Section D · Post-training picks up.

STATE OF PLAY — June 2026
· The "data wall" is real but contested. High-quality PUBLIC text is the scarce
  resource; total data (private, synthetic, multimodal) is not. The frontier
  debate is no longer "are we running out?" but "do the escape hatches work?"
· Escape hatch #1 — SYNTHETIC DATA — is now mainstream: frontier models train
  partly on text generated/curated by other models. Bull case: it extends the
  supply and lets you target weak spots. Bear case: "model collapse" — training
  on your own exhaust can quietly degrade quality if done carelessly.
· Escape hatch #2 — NEW MODALITIES — video/audio/image are increasingly treated
  as fresh pre-training signal as text tightens.
· Compute-rich players (the largest US labs, plus well-funded Chinese labs like
  DeepSeek/Qwen) can still out-scale; but the Chinchilla-era "just add data and
  parameters" reflex has clearly given way to data-quality, synthetic-data, and
  post-training/RL competition as the main levers.
· Continued pre-training + domain corpora is the standard way smaller players
  build credible specialists on top of open base models.
Specific labs, token-stock estimates, and exhaustion dates will age fast; the
mechanisms (next-token prediction, the size↔data tradeoff, the diet effect,
the finite-data tension) will not.

Section E  Reinforcement Learning, in plain English

Everything so far — pre-training, fine-tuning — was the model learning from examples that already existed. Someone wrote the text; the model copied the pattern. Reinforcement learning (RL) is fundamentally different, and the difference is the whole point: there are no examples to copy. The model has to learn from the consequences of its own actions.

That's a big shift, so let's not start with the model at all. Let's start with a dog.

E1  The whole idea, in one analogy: training a dog

You want to teach a dog to sit. You can't explain it. You can't show it a textbook. All you can do is: wait, watch what the dog does, and reward the behavior you like. Dog flops down? No treat. Dog sits? Treat. Over many tries, the dog does more of what gets treats and less of what doesn't. It never gets told the rule — it discovers the rule by chasing the reward.

That is reinforcement learning, entire. Now here's the same picture with the five pieces of jargon labeled — because once you've seen them on the dog, they'll never scare you again:

Same five pieces, on a dog you already understand.

Every term below is the whole field of RL. None of them are new — you just learned them as a kid, teaching a dog to sit.

Policy the plan

The dog's current strategy for getting treats.

Rollout one try

One attempt to sit, start to finish.

Reward R9

Treat or no treat. Just a number — not yet good or bad.

Advantage reinforce suppress

Better or worse than its average try. This is the part that actually teaches.

Explore​-​Exploit the choice

Try something new vs. repeat what already worked.

This panel is the key. The same five colors and words come back later as the actual training loop — only then, the "dog" is a language model.

The Rosetta Stone for reinforcement learning. Five jargon words, one familiar scene.

Let's take the five pieces one at a time. We'll keep the dog around, and bring in a video game when it helps.

E2  Policy — the player's current strategy

The policy is just the model's current strategy for what to do next. For the dog, it's "given what I'm seeing and hearing, what should my body do?" For a language model, the policy is literally the model itself: given the conversation so far, what's its strategy for choosing the next token?

The entire goal of RL is to improve the policy — to make the strategy better over time. At the start it's bad (the dog flops, the model rambles). Each round of training nudges the strategy toward choices that earn more reward. When a lab says "we did RL on the model," they mean: we ran this loop to upgrade the model's strategy.

Think of it as a video game too: your policy is your current playing style — how good you are at the game right now. A beginner's policy mashes buttons; an expert's policy is refined. RL is the practice that turns one into the other.

E3  Reward — the treat, and the trouble with treats

The reward is the signal that tells the model how good an outcome was. Treat for the dog. Points for the game. For an LLM, the reward might come from that reward model we met in post-training (it scores "how much would a human like this answer?"), or — in the powerful newer setups — from something far more objective: did the code pass the tests? Did the math problem reach the correct final answer?

That last point is quietly enormous, and it's worth flagging now because it explains a lot of the 2025–2026 frontier:

Why math and code are the RL goldmine. In most of life, "was that a good answer?" is fuzzy and needs a human to judge. But for math and code there's an automatic, unarguable reward: the answer is right or wrong, the tests pass or fail. That means you can run RL at massive scale with no humans in the loop, generating millions of attempts and rewarding the ones that work. This is why the models that suddenly got dramatically better at math and coding got there through RL — those domains hand you a perfect treat-dispenser for free.

But rewards are also where RL gets dangerous, and this is the limitation you must understand to evaluate any "we used RL" claim. Whatever you reward, you get — including the loopholes. This is reward hacking (a flavor of Goodhart's Law: when a measure becomes a target, it stops being a good measure). The dog version: if you accidentally treat the dog every time it barks while sitting, you'll train a dog that sits and barks its head off, because barking became part of "what gets treats." The model version: if your reward model slightly prefers longer, more confident-sounding answers, RL will gleefully produce a model that's longer-winded and more confidently wrong — it found the loophole. RL optimizes exactly what you measure, not what you meant.

This is why frontier labs spend so much effort designing rewards that can't be gamed, and why "we did RL and the benchmark went up" should make you ask: did the model get smarter, or did it just learn to please your specific reward? (We'll return to that as a litmus test.)

E4  Rollout — one full attempt at the level

A rollout is one complete attempt, start to finish. The dog's single try at sitting. One full playthrough of a game level. For a language model, a rollout is the model generating a whole answer to a prompt — the entire response, start to "done."

Why give this its own word? Because RL learns by comparing many rollouts. You don't learn much from one attempt. You let the model take a hard problem and try it, say, a hundred different ways (this is where temperature and randomness earn their keep — they make the attempts vary). Some rollouts nail it; some flop. The reward sorts them. And from that spread of "this attempt good, that one bad," the model figures out what to do more of. Rollouts are the raw experience RL learns from — the model's own attempts are its only textbook.

One problem, many attempts.

RL doesn't need a textbook answer — it just needs to know which of its OWN attempts worked, then do more of those.

PROMPT
Solve: 17 × 24 = ?
One prompt → many attempts.
17 × 24 = 408 R 9
reinforce
(17×20)+(17×4) = 408 R 9
reinforce
408 (rounded check) R 8
reinforce
17 × 24 = 388 R 2
suppress
17 + 24 = 41 R 1
suppress
RL learns by comparing many rollouts — the spread is the lesson.

Illustrative rollouts for a math problem. Green checkmarks indicate correct answers, red X's indicate wrong answers.

E5  Advantage — "was that better than my usual?"

Here's the subtle one, and it's the key that makes RL actually work. Suppose the dog sits and gets a treat. Good — but how good? If the dog sits every time and always gets a treat, then this particular sit was nothing special; it's just average. But if the dog usually flops and this time it sat — that sit was a big positive surprise, and that's the moment worth reinforcing hard.

Advantage is exactly this: how much better (or worse) was this attempt compared to what I'd normally expect? Not the raw reward — the surprise in the reward. A rollout that scored above the model's average gets pushed harder ("do more of this!"); one that scored below gets pushed away ("less of that"); one that's exactly average barely moves anything.

Why not just use the raw reward? Because raw scores are noisy and uninformative on their own. A "7 out of 10" means nothing until you know whether 7 is great (you usually get 3s) or disappointing (you usually get 9s). Advantage is the baseline-subtracted signal — it strips out "how hard is this problem in general" and isolates "did this attempt beat my own expectation." That's the clean learning signal. It's why a beginner gamer improves fastest: almost everything they try is "better than my terrible average," so the advantage signal is strong and every small win teaches a lot.

Advantage = reward − baseline

Reward says "this scored 9." Advantage says "this beat your usual 6 — do more of it." The second one is what actually drives learning.

FIXED PROMPT & ROLLOUTS
Solve: What is the capital of France?
Model's current baseline (average score): 6.0
Drag the baseline to see how advantages flip. When your average goes up, the same rollout becomes less impressive.

Interactive advantage calculation. The relative comparison is what drives learning, not the absolute reward scores.

Under the hood, lightly. The famous RL algorithms you'll hear named — PPO (the workhorse from RLHF) and the leaner GRPO that powered DeepSeek's math breakthrough — are, at heart, careful machinery for computing this advantage and nudging the policy by it without lurching too far in one step.9 10 That "don't lurch too far" guardrail matters: push the model too hard toward the reward in one update and it can break — forgetting its language skills while chasing points (a failure labs informally call drift, related to catastrophic forgetting) — which labs hold back with a leash (a KL penalty) tying the model to its sensible starting point. You don't need the equations. You need the shape: try many times, see which tries beat your average, lean that way — but gently.

E6  Exploration vs. exploitation — the gambler's dilemma

The last piece is the tension that sits underneath all of RL, and it's deeply human. Imagine your favorite restaurant. Every night you face a choice: order the dish you know is great (exploit what works), or try something new on the menu that might be even better — or might be a disappointment (explore). Order the usual forever and you'll never discover the better dish. Gamble every night and you'll eat a lot of bad meals. The art is the balance.

That's exploration vs. exploitation, and every RL system lives or dies by it:

Every step, a choice.

Learn nothing new, or risk everything — RL is the constant art of tuning this dial. Early on, explore boldly. As you get good, exploit what works.

EXPLORE-EXPLOIT
Every step, a choice:
EXPLOIT
Repeat what worked
Take the move that already scored — the safe, known win.
EXPLORE
Try something new
Risk an untested move — it might be worse, or it might be better.
Exploit too hard and it never improves; explore too much and it forgets what works. Balance is the whole game.

The fundamental tension in reinforcement learning — between safety and discovery.

This is also where you can feel why RL on language is so much harder than RL in a game. In chess, every move is legal-or-not and the board tells you the truth. In language, the space of possible "moves" (sentences) is effectively infinite, the reward is often a fuzzy human judgment, and a model can explore its way straight into eloquent nonsense that fools the reward model. RL gave us the leap in reasoning models — but it's a leap walked on a knife's edge between "discovered something genuinely new" and "found a clever way to cheat the score."

E7  Putting it together — and how to use it as a bullshit detector

Step back and you can now read RL as one clean loop, in five plain words: try, score, compare, lean, repeat. The model (policy) takes many full attempts (rollouts), each earns a reward, advantage measures which attempts beat the model's own average, the strategy leans toward those — gently, on a leash to prevent drift — while balancing exploration against exploitation. Run that loop at scale, with a reward you can trust, and you get the dramatic reasoning gains of the modern era.

Try, score, compare, lean, repeat.

Every modern reasoning model is this loop, run a staggering number of times.

1
POLICY tries
Current strategy generates one ROLLOUT — an attempt.
2
REWARD scores it
A number for the attempt — R 9
3
ADVANTAGE = reward − baseline
Above average  reinforce, below  suppress.
4
Update POLICY
Nudge the weights toward what beat the baseline. Repeat.
the same machine, again
Five words ran the dog; the same five words run the model — unchanged.

The complete reinforcement learning loop. Policy → Rollouts → Rewards → Advantage → Policy update → Repeat.

And here's the payoff — the reason a layperson should care about any of this. When someone tells you "our model got better because of reinforcement learning," you now own the questions that separate substance from spin:

If they have crisp, technical answers, you're likely looking at real work. If they wave their hands and say "we did RL," you now know enough to keep your wallet closed.

E8  PPO, from scratch — how to step toward the reward without falling off the cliff

We just said the whole loop in five words: try, score, compare, lean, repeat. But there's a word doing enormous quiet work in there, and it's lean. Once advantage has told you which attempts beat your average, you have to actually change the policy — nudge the strategy toward the good attempts. And it turns out how hard you nudge is the entire ballgame. Push gently and the model barely learns. Push too hard and the model breaks. PPO is the field's answer to "how hard do I push?" — and once you feel why that question is dangerous, the famous algorithm stops being an acronym and becomes obvious.

Start with the danger, because it's not intuitive. Go back to the restaurant. Suppose one night you try a new dish and it's spectacular — a huge positive advantage, way above your usual meal. The naive move is to overreact: "this is the best thing I've ever eaten, I will now order it every single night and never order anything else." You've just thrown away your whole balanced sense of what's good based on one great data point. Maybe that dish was great that night because the chef was in a mood. Maybe you'd hate it the third time. By lurching all the way over, you didn't just adopt a good idea — you destroyed the rest of your taste in the process.

A model does exactly this if you let it. A single batch of attempts says "longer answers scored higher" — and if you shove the policy hard in that direction in one update, the model can swing so far that it forgets how to write a short answer at all, or how to write coherently, while chasing that one signal. The update meant to make it slightly better makes it dramatically, brittlely worse. This is the cliff. The signal you're learning from is noisy and partial, but the update is permanent — so a big confident step on a small noisy signal is how you wreck a working model.

So the real engineering problem of RL on a language model isn't "find the good attempts." Advantage already did that. The problem is: take a step toward them that is big enough to learn from but small enough that you can't fall off the cliff. You want to nudge, never yank.

That is PPO — Proximal Policy Optimization, the workhorse algorithm behind the original RLHF that made ChatGPT-style models follow instructions.6 9 The name is just the idea spelled out: proximal means "stay near where you started." PPO's one trick is a clip — a hard cap on how far a single update is allowed to move the policy. If an attempt had a big positive advantage, PPO says "yes, lean toward it… up to a limit, and not one inch past." Beyond that limit, extra eagerness earns you nothing — the update is clipped flat, so there's no incentive to lurch.9 It's the seatbelt that lets you press the accelerator.

The cooking analogy makes it concrete. You taste a sauce; it needs salt. The reward (advantage) says "saltier is better." A reckless cook dumps in the whole shaker — and now it's inedible, overshooting the very signal that was trying to help. A good cook adds a pinch, tastes again, adds another pinch. PPO is the pinch. It refuses to add more than a pinch per taste, no matter how strongly that one taste screamed "MORE SALT." The reward might be right about the direction and badly wrong about the dose — so you trust the direction and distrust the dose. That gap, between which way and how far, is the whole reason PPO exists.

Same direction, very different dose.

Both steps head the exact same way — toward the reward. The only difference is how far. PPO caps the step at the proximal ring and stays safe; the reckless one keeps going until it's out past where the model holds together. Lean, don't lurch.

safe — proximal (the clip radius) danger — the model breaks REWARD same direction → toward the reward reckless · huge dose PPO · capped dose home
PPO — same direction, capped dose. The step stops at the proximal ring (the clip). Close enough to home that one noisy signal can't wreck the model.
Reckless — same direction, huge dose. One giant step on a noisy signal carries far past the ring into the danger zone. The update meant to help breaks the model.

Schematic of PPO's clipped step. The "clip" caps how far one update may move the policy, no matter how strong the advantage.

Now — why did a chunk of the field walk away from PPO, and what is this GRPO you keep hearing about? This is the part worth understanding, because it's the lever that let small open labs reach the frontier on a budget, and it shows up directly in how DeepSeek and others priced their breakthroughs.

PPO has a hidden cost. To know whether an attempt "beat the average," PPO trains a second whole model alongside the one you care about — a "value model" (also called a critic) whose only job is to predict the expected reward, so you have a baseline to subtract. Two models, twice the memory, twice the bookkeeping. Expensive. And on math and code — where, remember, you can just try the same problem many times — there's a cheaper baseline sitting in plain sight.

That cheaper baseline is the heart of GRPO — Group Relative Policy Optimization, the method introduced in DeepSeek's DeepSeekMath work.10 The insight is almost embarrassingly simple, and it's pure E5: you already have a spread of attempts at the same problem (your rollout fan from E4). So why train a separate model to guess the average — just use the actual average of the group of sibling attempts as the baseline. Did this attempt beat the other attempts at this same problem? That's your advantage. No second model. No critic. You judge each try against its own siblings.10

The restaurant version: instead of hiring a food critic to tell you what a dish "should" score (PPO's value model), six friends each order, and you simply compare each plate to the table's average that night. Cheaper, no critic on payroll, and for problems where you can sample many attempts cheaply, just as informative.

Same question — "did this beat my average?"

PPO trains a second model to guess the average. GRPO just uses the average of the attempts it already has. That cost-cut is a big reason cheap, open math-RL scaled the way it did.

PPO two models, more compute
ROLLOUTR 9
ROLLOUTR 7
ROLLOUTR 2
value model / critic → predicts the baseline to subtract
GRPO one model, cheaper
ROLLOUTR 9
ROLLOUTR 7
ROLLOUTR 2
baseline = the group's own average → each ROLLOUT measured against that line · above = ADVANTAGE

PPO hires a critic to guess the average; GRPO reuses the average of the rollout fan it already had. Same safety, no second model.

Schematic comparison. Both clip the step (E8) and keep a KL leash (E9); they differ only in where the advantage baseline comes from.

So the field's arc here is clean: PPO gave us a safe way to step toward a reward without breaking the model — that safety is what made RLHF practical at all.6 9 GRPO kept the same safety (it still clips the step, still nudges-not-yanks) but threw out the expensive critic, swapping it for the group baseline you already had lying around — which is exactly why verifiable-reward RL on math and code got cheap enough to run at enormous scale, and why, by 2025–2026, much of the reasoning-model wave was trained with recipes of this kind.10 What neither can do: invent a good reward, or save you if your reward is gameable. The step is now safe; the target still has to be honest. (Straight back to E3 — the machinery for leaning carefully says nothing about whether you're leaning toward the right thing.)

STATE OF PLAY — June 2026
· GRPO and its variants (the leaner, critic-free recipes) are now the default
  for large-scale RL on verifiable rewards; PPO remains common where a learned
  value model still earns its keep (e.g. some RLHF-from-human-preference work).
· "Did you use a critic or a group baseline?" is a fair, cheap signal of how a
  lab's RL costs actually scale. The specific recipe names will churn; the
  nudge-don't-yank principle underneath them will not.

E9  The KL leash — keeping the model from forgetting how to be a model

PPO's clip stops you from taking one catastrophically big step. But there's a slower, sneakier failure that a step-size cap alone won't catch — and it's the one that should be on your diligence checklist. It's not about any single update being too large. It's about a thousand small, safe-looking updates that all quietly drag the model in the same direction, until it has wandered somewhere terrible.

Here's the failure in its purest form, and it's the dark twin of everything good about RL. Reward pulls the model toward whatever scores well. But "scores well on the reward" and "is still a sensible language model" are not the same thing — and over many rounds, the model can chase the first while losing the second. It learns to produce text that the reward loves and a human finds increasingly unhinged: stilted, repetitive, exploiting some tic of the scorer, eventually collapsing into fluent-looking gibberish that racks up points. It got better at the reward by getting worse at language. The model is, in a real sense, forgetting how to be a model in order to win the game. It's a cousin of what the field has long called catastrophic forgetting — a network learning a new task abruptly erasing what it knew before14 — but the RL version has its own name: drift. Same family (improvement on one front quietly costing you another), distinct mechanism (here it's reward-chasing pulling the model away from coherent language, not a second task overwriting a first).

This is the same monster from E3 — reward hacking — but seen as motion over time rather than a single loophole. The model isn't jumping off a cliff (PPO handles that). It's wandering away from home, one reasonable-looking step at a time, until it's lost.

The fix is exactly what you'd do with anything prone to wandering off: put it on a leash. Before RL starts, you have a perfectly sensible model — the one that came out of pre-training and fine-tuning, that writes fluent, coherent language. You tie the model-being-trained back to that original, sensible version with a tether, and the tether pulls back whenever the new model drifts too far from how the original would have spoken. The model is free to roam and improve at the task — but it can't wander off the property.

That leash has a name you'll hear in every serious RL conversation: the KL penalty. (KL is just a math measure of "how far apart are these two ways of speaking?" — you do not need the formula; you need the picture of a leash.) Every update now answers to two masters: get more reward, and don't drift too far from the sensible starting model.6 10 The reward says "go!" The leash says "…but not past here." What survives is improvement that stays recognizably a good language model.

The reward pulls outward; the KL leash ties it to home.

Long enough to learn, short enough not to forget how to talk. Same machine as E8 — now guarding against drift over time instead of one big step.

still a sane language model drift · reward-hacked gibberish leash limit — “…but not past here” home the original sensible model KL leash model-in-training reward → pulls out ← KL pulls home where it drifts if the leash is cut (or too loose) REWARD
Reward pulls outward. Every update tugs the model toward whatever scores well — even if “scores well” drifts away from coherent language.
The KL leash pulls home. It tethers the model to the sensible one it started as, so it can roam and improve — but not wander off the property.

The reward pulls the model outward; the KL leash ties it to the sensible model it started as. Echoes E8's lean, don't lurch — same machine, now over time.

Schematic of the KL penalty. "KL" is a measure of how far the trained model's way of speaking has drifted from its starting point.

And here is the craft, the thing that separates a team that knows what it's doing from one that doesn't — because the leash has no free lunch. Pull it too tight and the model is chained to the stake: it can't move, so it can't learn, and your expensive RL run barely improves anything. Let it too loose and you're back to drift — the model slips the leash and wanders off into reward-hacked nonsense. The right tension is not a number you can look up; it depends on the task, the reward, how trustworthy that reward is, how long you train. It is tuned, by people who've felt it go wrong both ways. The leash length is a dial, and knowing where to set it is exactly the kind of hard-won judgment you're paying a frontier team for.

Drift is a tradeoff you can feel, not a setting you can copy.

Too loose, it forgets how to talk. Too tight, it can't learn anything. There's no lookup value for the sweet spot — finding it is the craft.

Task reward
0
Coherence
0
leash tension (KL strength) balanced
sweet spot
loosetight

Illustrative tradeoff. Reward and coherence are precomputed functions of the KL strength; the sweet spot is deliberately narrow.

Which lands us back at the bullshit detector from E7, now with sharper teeth. When a team says "we did RL and the benchmark went up," the drift question — "how did you keep it from drifting?" — is no longer vague. You now know precisely what you're probing: Did you leash the model to its sensible self, and did you tune that leash well enough to get real gains without it quietly forgetting how to be a coherent model outside your test? A team with a crisp answer (a real baseline model, a tuned KL, evidence the model stayed general) is doing real work. A team that can't speak to it may have a model that dazzles on their one benchmark and falls apart the moment you take it off the leash — brittle exactly where it matters. The leash isn't a footnote of RL. It's the difference between a model that genuinely learned and one that just learned to game you without you noticing.

Notes  Notes & sources

The conceptual backbone above is evergreen. The boxed material below dates, and is fenced off deliberately.

STATE OF PLAY — June 2026
· No single "best" model: GPT-5-series, Claude Opus 4.6/4.7, Gemini 3.1 Pro,
  and DeepSeek (V3.2 / V4) each lead different slices — science reasoning,
  coding, agentic tasks, and price-performance respectively.
· RL on verifiable rewards (math, code) is the dominant frontier lever; open
  labs (DeepSeek, Qwen) reached the frontier largely via cheaper RL recipes
  (e.g. GRPO) rather than sheer scale.
· Reasoning models that "think" before answering (test-time compute) are now
  standard at the frontier, not a novelty.
Specific models/numbers will age fast; the mechanisms above will not.

Primary sources (canonical papers, verified via the Valency academic corpus)

  1. Sennrich, Haddow & Birch, Neural Machine Translation of Rare Words with Subword Units (2015), arXiv:1508.07909 — the BPE subword-tokenization scheme.
  2. Radford et al., Language Models are Unsupervised Multitask Learners (2019, the GPT-2 report) — byte-level BPE, which makes every input (including emoji and unseen symbols) representable.
  3. Mikolov et al., Efficient Estimation of Word Representations in Vector Space (2013), arXiv:1301.3781 — static word embeddings (word2vec) and the king−man+woman≈queen geometry. (A static-embedding result; LLM token embeddings use the same near-means-similar idea but are contextual — see source 4.)
  4. Vaswani et al., Attention Is All You Need (2017), arXiv:1706.03762 — the Transformer and self-attention.
  5. Hoffmann et al., Training Compute-Optimal Large Language Models (2022), arXiv:2203.15556 — the "Chinchilla" compute-optimal scaling result.
  6. Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (2022), arXiv:2203.02155 — InstructGPT / RLHF.
  7. Rafailov et al., Direct Preference Optimization: Your Language Model Is Secretly a Reward Model (2023), arXiv:2305.18290 — DPO.
  8. Bai et al., Constitutional AI: Harmlessness from AI Feedback (2022), arXiv:2212.08073 — Constitutional AI / RLAIF.
  9. Schulman et al., Proximal Policy Optimization Algorithms (2017), arXiv:1707.06347 — PPO.
  10. Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024), arXiv:2402.03300 — GRPO.

Sourced for the boxed/dramatic claims (per research-discipline rule on dramatic numbers)

  1. Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages (2023), arXiv:2305.15425 — some languages fragment into several times more tokens than English.
  2. Epoch AI, Tracking frontier training compute & cost — estimate that frontier pre-training runs cost tens to hundreds of millions of dollars. (Estimate; figure moves over time.)
  1. Schulman et al., Trust Region Policy Optimization (2015), arXiv:1502.05477 — TRPO; the explicit trust-region constraint that PPO later simplified into a clip (E8).
  2. French, Catastrophic Forgetting in Connectionist Networks (1999), Trends in Cognitive Sciences 3(4):128–135 — the classic study of networks forgetting prior learning under new training. Cited in E9 as the cousin of RL “drift,” not an identity. (French names sequential-task interference; RL drift is reward-over-optimization pulling a model from coherent language. The KL-penalty fix is sourced to fn-6 and fn-10.)
  3. Hochreiter & Schmidhuber, Long Short-Term Memory (1997), Neural Computation 9(8):1735–1780, DOI:10.1162/neco.1997.9.8.1735 — the LSTM, the gated RNN that was the leading approach to sequence modeling before the Transformer (B1).
  4. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022), arXiv:2205.14135 — a hardware-aware reorganization of attention with identical mathematical results; the canonical example of architecture/hardware co-design (B5).
  5. Gu & Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023), arXiv:2312.00752 — representative of the sub-quadratic “after the Transformer” research frontier referenced in B6 and the State of Play box.
  6. Supporting context (B): Kaplan et al., Scaling Laws for Neural Language Models (2020), arXiv:2001.08361 — the earlier scaling-laws result underpinning “more compute → better models”; and Tay et al., Efficient Transformers: A Survey (2020), arXiv:2009.06732 — survey of efficient-attention approaches attacking the quadratic cost.
  7. Penedo et al., The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale (2024), arXiv:2406.17557 — large-scale web-data curation; careful filtering/dedup of pre-training data measurably improves downstream model quality (C3).
  8. Kaplan et al., Scaling Laws for Neural Language Models (2020), arXiv:2001.08361 — the influential earlier scaling laws that leaned toward prioritizing model size; the view Chinchilla later corrected (contrast, C4).
  9. Hoffmann et al., Training Compute-Optimal Large Language Models (2022), arXiv:2203.15556“Chinchilla.” For a fixed compute budget, model size and training tokens should scale together; most prior large models were undertrained. Trained 400+ models; the 70B Chinchilla beat the 280B Gopher, 175B GPT-3, and 530B Megatron-Turing NLG (C4).
  10. Villalobos et al. (Epoch AI), Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data (2022, updated 2024), arXiv:2211.04325 — the canonical “data wall” analysis: a few hundred trillion tokens of high-quality public text, projected exhaustion in the second half of the 2020s (C4, C6).
  11. Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget (2023), arXiv:2305.17493 — the “model collapse” concern in the June-2026 State-of-play box: careless training on model-generated data can degrade quality.

Supporting: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020, arXiv:2005.14165); Wei et al., Chain-of-Thought Prompting Elicits Reasoning in LLMs (2022, arXiv:2201.11903).