An interactive primer

How LLMs Actually Work

Trace one message through the whole machine — then take every piece apart with your own hands. No math required; bring only curiosity.

18 min read·Interactive·Updated Jun 2026

Reader's contract. You are smart and curious but you are not an ML engineer, and you don't want to become one. You want to understand — well enough to look a founder or a researcher in the eye and know whether their claim holds water. This document leads with pictures and analogies, defines every piece of jargon the moment it appears, and never makes you read a wall of text when a diagram would do. Math stays in the basement; we'll only come upstairs for it when a number actually changes how you think.

◆

Trace one message through the whole stack

Watch the whole machine run once

Before we take anything apart, let's watch the whole machine run once, end to end, on a single real message. Everything in the rest of this document is just a zoom-in on one of these steps. Keep this picture in your head; we'll hang every later idea off it.

You type into a chat box:

"How many r's are in strawberry?"

You hit enter. To you it feels instant and obvious. To the model, your sentence is about to go through five transformations before a single word comes back. Here is the journey.

One message, six stages.

Everything later in this article is a zoom-in on exactly one of these boxes. The model never sees letters — only the chunks in stage 1.

STAGE 1

Tokenize

Split the sentence into chunks.

STAGE 2

Embed

Each chunk becomes a vector.

STAGE 3

Attention

Tokens read each other.

STAGE 4

Predict

Rank every possible next token.

STAGE 5

Sample

Pick one (temperature decides how boldly).

STAGE 6

Output, loop

Append it, run again — one token at a time.

Illustrative pipeline. Each stage gets its own section below.

Stage 1 — Tokenize. The model can't read letters. The first thing that happens is your sentence gets chopped into tokens — chunks of text, usually a word or a fragment of a word, that the model was taught to recognize as atomic units. "How", " many", " strawberry" might each be a token; longer or rarer words get split into pieces. Crucially, the model now sees chunks, not letters — remember this; it's the whole reason the strawberry question is hard for it.

Stage 2 — Embed. Each token is turned into a long list of numbers — a vector, which is just a coordinate that places the token somewhere in a vast "meaning-space." Tokens with similar meanings land near each other. The word is gone; a position in space has replaced it.

Stage 3 — Attention / Transformer. Now the model lets every token look at every other token and decide which ones matter for understanding it. This is attention, and it's the engine. "r's" looks back at "strawberry" and "how many" to figure out what's being counted. This happens in stacked layers, each one refining the picture.

Stage 4 — Predict. After all that looking-around, the model produces one thing and one thing only: a giant ranked list of every possible next token, each with a probability. It is, at heart, the world's most sophisticated autocomplete. For our prompt, the top candidates might be "There", "The", "Straw…", each with a score.

Stage 5 — Sample. From that ranked list, the model samples — picks one token. How adventurously it picks is controlled by a dial called temperature (we'll play with it later). Pick the safe top choice, or roll the dice on a lower-ranked one.

Stage 6 — Output, then loop. The chosen token is shown to you, then appended to the input, and the whole pipeline runs again to pick the next token. And again. One token at a time, looping, until it produces a special "I'm done" token. That streaming you see in a chat window? That's this loop, live.

The punchline you should already feel: the model never "counts the r's." It pattern-matches its way to an answer, token by token, having never seen the individual letters in "strawberry" at all. That's not a bug in one model — it's a direct consequence of Stage 1. By the end of Section A you'll understand exactly why, and you'll never be fooled by a "the AI can't spell" headline again.

Now let's earn that understanding. We'll walk the same pipeline again — slowly, properly — as the life story of a model: how one is born, and then how it's used.

Section

The lifecycle of a modern LLM

Born, shaped, then used

Here's the spine of this whole section, the story arc we're about to tell. A model isn't programmed. It's grown, then shaped, then used. Eight stages, one continuous story. Everything up to the final stage happens once, in a data center, over months — the BIRTH of the model. The last stage, inference, happens every single time you send a message — the model's LIFE. Let's go.

A1 · Tokenization

Teaching the model an alphabet of its own

A model can't see text the way you do. Before anything else, we have to convert writing into numbers, and the very first decision is: what is the smallest unit the model is allowed to see?

The naive answer is "letters." It fails: spelling out everything letter by letter makes sequences impossibly long and throws away the obvious fact that "running" and "runner" share a root. The other naive answer is "whole words." That fails too: there are millions of words, names, and typos, and the model would be helpless the first time it met a word it had never seen.

The field's answer is a beautiful compromise called subword tokenization — most commonly a scheme named Byte-Pair Encoding (BPE).¹ The idea: start with small units, then repeatedly glue together the pairs that show up most often, until you've built a vocabulary of common chunks. Frequent words ("the", "strawberry") become single tokens; rare words get assembled from a few pieces ("tokenization" → "token" + "ization"). Modern models run BPE not on characters but on raw bytes (byte-level BPE, introduced with GPT-2²) — which is what guarantees that anything the model meets, even a word it's never seen, an emoji, or a stray symbol, can always be spelled out from smaller fragments as a last resort. Nothing is ever un-representable.

See it as the model sees it.

Type anything. Watch words land as a chunk or two — not as letters. That gap is why it miscounts r's.

TOKENS

characters: 0 · tokens: 0

Token splits are precomputed and illustrative; real tokenizers vary by model. IDs shown for "strawberry": straw = 15140, berry = 19772.

This one design choice has consequences that ripple through everything:

It's why models miscount letters. "Strawberry" arrives as a chunk or two, not as ten letters. Asking the model to count r's is like asking you to count the serifs in a word you read at a glance — the information was never in your conscious view. Founders who demo "our model can finally spell!" are usually just bolting a calculator-like tool onto the side; the core limitation is structural.
It's why tokens, not words, are the unit of pricing and context. When a lab says a model has a "200,000-token context window," that's tokens, not words — roughly 150,000 English words, but far fewer for code or other languages, where text fragments into more tokens.

Context window = what the model can see

A model can only answer from what's inside its window. Shrink it, and the earliest facts simply vanish.

CONVERSATION

Context window size: 6 turns

The fact never changed. The model just can't see it once it scrolls out of the window. This is why long chats 'forget' the start.

Illustrative. Real windows are 100k–1M+ tokens; the failure mode is identical, just further out.

It's why some languages cost more. English BPE vocabularies fragment other scripts into many small tokens, so the same sentence in, say, Thai or Hindi can cost several times more tokens — and therefore more money and more of the context window — than in English.¹¹

So: we've turned writing into a stream of token-IDs. But an ID is just a name tag — the number "5176" tells the model nothing about what "strawberry" means. That's the next problem.

A2 · Embeddings

Giving every token a place in meaning-space

A token-ID is arbitrary. We need to convert each one into something that actually carries meaning. The trick: represent every token as a long list of numbers — a vector — that you can think of as coordinates in a high-dimensional space of meaning. (High-dimensional just means "lots of coordinates" — hundreds or thousands per token, instead of the three we live in. Don't try to picture it literally; picturing 3-D and trusting the math is enough.)

12,288

coordinates in a single GPT-3 token vector. We flatten it to 2 below — the intuition survives.

The magic property: the model learns these coordinates so that tokens with similar meaning land near each other. "King" sits near "queen." "Paris" sits near "France." This learned vector is called an embedding. Directions in the space can even encode relationships — the famous result that king − man + woman ≈ queen.³ (One honest caveat: that clean piece of arithmetic comes from an earlier, static kind of word embedding — word2vec, 2013 — where each word has one fixed vector. The token embeddings inside an LLM use the same near-means-similar idea, but they don't stay fixed: the very next stage adjusts each one based on context. So treat king−queen as the intuition pump it is, not a literal operation happening inside GPT.)

Words become coordinates.

Related words land together. Watch the arithmetic: take king, apply the same step that turns man into woman, and you arrive at queen. The offset itself carries the meaning.

Illustrative; real embeddings have thousands of dimensions, flattened here to two. The king−man+woman result comes from static word2vec embeddings, not from inside an LLM.

Why does this matter so much? Because once meaning is geometry, reasoning starts to look like arithmetic the machine can actually do. The model isn't shuffling words; it's moving points around in a space where "closer" means "more related." Every later stage operates on these vectors, never on the text.

But there's a gap. Right now each token's vector is fixed — "bank" has one location, whether you mean a riverbank or a savings bank. Meaning in real language depends on context. We need a mechanism that lets each token adjust itself based on its neighbors. That mechanism is the heart of the whole revolution.

A3·4

A3 & A4 · The Transformer and Attention

Letting words read the room

This is the engine. It's worth slowing down, because if you understand this one idea, you understand why the last decade happened.

The problem the field was stuck on. Before 2017, the leading approach read text the way you'd read through a straw — one word at a time, left to right, trying to cram everything it had seen so far into a single running "memory." (These were called RNNs, recurrent neural networks — recurrent meaning they looped over the sequence step by step.) Two fatal flaws: they forgot the beginning of long passages by the time they reached the end, and because each step depended on the one before it, they couldn't be sped up by doing the work in parallel. Training was slow, and long-range understanding was poor.

The 2017 breakthrough — a paper bluntly titled Attention Is All You Need⁴ — threw out the straw entirely. Its architecture, the Transformer, lets the model look at many words at once and, for each word, decide which other words it should pay attention to. That's it. That's the idea. It's called self-attention: a token gets to ask other tokens "how relevant are you to me, right now?" and weight them accordingly. One crucial detail for the chat models you actually use: they read left-to-right and look only backward — each token can attend to the words that came before it, never the ones still to come. (It has to be this way: when the model is predicting the next word, the future words don't exist yet.) This is called causal (or masked) attention.

Here's the analogy that makes it click. Picture a dinner-party conversation. When someone says "it," they mentally check back over what's already been said: what does "it" refer to? — and the word "it" effectively turns up the volume on the noun it points back to, and turns down the irrelevant chatter. Each word builds its understanding by selectively listening to the room. Where the metaphor stops: in a chat model, a guest can only hear the people who spoke before them — nobody hears the future. That's the causal rule above, and it's why the model can generate one word at a time at all.

Hover a word. Watch it look backward.

Earlier words light up by how hard the hovered word attends to them. Forward words grey out — the model never peeks at the future.

Tap a word to lock it; hover to peek. Each word can only look backward.

Attention weights are precomputed and illustrative. Causal rule enforced: a word can only attend to words before it.

Two design notes that pay off later:

It's done in parallel, which is why scale became possible. Because attention looks at all tokens at once instead of marching through them one by one, the math is mostly large matrix multiplications — exactly the operation that graphics chips (GPUs) do blisteringly fast in parallel. The Transformer didn't just understand better; it understood in a shape the hardware loved, and that unlocked training on a scale RNNs could never reach. Architecture and hardware clicked together, and the race was on.
It's stacked into layers. One attention step isn't enough. A Transformer stacks dozens of these "everyone listens to everyone" layers, each refining the representation. Early layers catch grammar and nearby relationships; deeper layers assemble meaning, then something we loosely call reasoning. A modern frontier model is just a very deep stack of this same move.

So now we have an architecture: a tall stack of attention layers that turn a sequence of token-vectors into a rich, context-aware understanding, and finally into a prediction of the next token. But an architecture is an empty engine. It knows nothing yet. We have to fill it with knowledge. That's training, and it comes in three escalating acts.

A5 · Pre-training

Reading the internet to learn the world

This is where a model gets its raw intelligence, and it's astonishingly simple to state: predict the next token, over and over, across a huge slice of human writing.

That's the entire objective. Show the model "The capital of France is ___" and have it guess; if it guesses wrong, nudge all those billions of internal numbers (called parameters — the knobs the model learns) a hair in the direction that would've been right. Do this trillions of times, over books, code, websites, and forums, and something remarkable happens: to get good at predicting text, the model is forced to learn the patterns behind the text — grammar, facts, a little arithmetic, the structure of an argument, the rhythm of a story. Understanding is a side effect of relentless autocomplete.

The pre-training loop.

No human grades anything. The internet IS the answer key — the next word is always sitting right there in the text.

Read text

A snippet from the internet: books, code, web.

→

Predict next token

Guess the word that comes next.

→

Check the answer

The real next word is right there — the internet IS the answer key.

→

Nudge the weights

Adjust slightly so the guess gets better. Repeat.

↺ × 1,000,000,000,000

Schematic of the self-supervised next-token objective. Input funnel: books + code + web + conversations.

$100M+

estimated cost of a single frontier pre-training run — months on tens of thousands of GPUs.

This stage is why models are so expensive and why only a handful of players do it: it eats months of time on tens of thousands of GPUs at an estimated cost of tens to hundreds of millions of dollars for a frontier run.¹² And it raises the central economic question of the field: given a fixed pile of money and compute, should you build a bigger model or feed it more data?

For a while everyone chased size — bigger model, bigger headlines. Then in 2022 a paper nicknamed Chinchilla showed the field had been doing it wrong: most big models were undertrained — too many parameters, too little data — and you'd get a smarter model for the same cost by making it smaller but feeding it far more text.⁵ The takeaway, now lore: data and model size must scale together. A model isn't "better" because it's bigger; it's better when its size and its training data are balanced for the compute you spent.

Bigger isn't smarter. Balanced is smarter.

For a fixed compute budget, error bottoms out where parameters and data are balanced — this chart quietly reset how every lab budgets a run.

After Hoffmann et al. 2022 (Chinchilla). Illustrative U-curve; no equations.

Why this matters for evaluation. When a startup brags about "a trillion-parameter model," the right question isn't "how big?" — it's "how much did you train it, and on what?" Parameter count alone is a vanity metric. Data quality and quantity are where models are actually won or lost.

At the end of pre-training you have a base model: a sprawling, knowledgeable, deeply weird text-predictor. It is not a helpful assistant. Ask it a question and it might continue with five more questions, because on the internet, questions are often followed by more questions. It has knowledge but no manners, no sense that it's supposed to help you. Fixing that is the next two acts.

A6·7

A6 & A7 · Fine-tuning and post-training

Turning a know-it-all into an assistant

A base model is a brilliant, feral library that talks like the average of the internet. Post-training is the finishing school that turns it into the polite, helpful "assistant" you actually chat with. It's where a model gets its personality and its alignment — and it happens in steps.

Step one: Supervised fine-tuning (SFT) — show, don't tell. We collect a pile of high-quality example conversations — a human writes an ideal answer to a prompt — and we fine-tune the base model on them. Fine-tuning just means more training, but now on a small, curated set instead of the raw internet. The model learns the format of being helpful: when you ask a question, you answer it; you don't ramble; you follow instructions. This is imitation — the model copies good examples.

Step two: learning from preferences — rank, don't script. Imitation has a ceiling: humans can't hand-write an ideal answer to every possible prompt, and "good" is often a matter of taste and degree. So we switch from showing to judging. We have the model produce two answers, and a human (or another model) says "this one's better." Do this across mountains of comparisons, and you can teach the model to produce answers humans prefer — more helpful, more honest, less likely to confidently make things up.

The landmark here is InstructGPT / RLHF — Reinforcement Learning from Human Feedback.⁶ The recipe: use all those human preference judgments to train a reward model (a model that scores how good an answer is), then use reinforcement learning to push the assistant toward higher-scoring answers. RLHF is the single biggest reason ChatGPT felt like a leap over raw GPT-3: same underlying knowledge, radically better behavior. (The full machinery of how RL actually works is the hard part — that's exactly what the RL section below is for.)

From feral library to assistant.

Same brain the whole time — we're not adding knowledge, we're shaping behavior. Helpfulness rises left to right.

PANEL 1

Base model

Spouts raw internet text — knowledgeable, but unhelpful and unsteerable.

→

PANEL 2 · + SFT

Imitation

Answers in a clean assistant format — learns how to respond by copying good examples.

→

PANEL 3 · + RLHF

Judgment

Learns which responses humans actually prefer — the step where it grows up.

Schematic escalation. Same model throughout — each stage shapes behavior, not knowledge.

A few things worth internalizing, because they're where evaluation gets sharp:

Post-training is why two models with similar raw intelligence can feel wildly different. Tone, refusal style, how it handles ambiguity, whether it pushes back — that's almost all post-training. When people say a model "has good vibes," they're describing post-training.
There's more than one recipe now. RLHF is powerful but fiddly. Newer methods like DPO (Direct Preference Optimization) skip the separate reward model and tune the model directly on the "this one's better" pairs — simpler and cheaper, often nearly as good.⁷ And Constitutional AI replaces some human feedback with the model critiquing itself against a written set of principles, so the labeling scales without armies of human raters.⁸ When a lab describes its "secret sauce," it's usually a particular blend of these post-training moves.
This is also where the limits live. Post-training can make a model sound aligned without making it be reliable. A model can be trained to give answers humans rate highly — which is not the same as answers that are true. (When the model produces a confident, fluent falsehood, that's a hallucination — and post-training can accidentally reward exactly the smooth confidence that produces them.) Hold that thought; it's the dark side the RL section explains.

The model is now born and raised: knowledgeable from pre-training, helpful from post-training. It sits frozen, finished, weighing in at billions of parameters. Now — finally — someone sends it a message. That's the last stage, and it's the only one that happens every single time you hit enter.

A8 · Inference

The model, in use, one token at a time

Inference is the model running — taking your prompt and generating a reply. This is the loop from the on-ramp, and now you have the full picture of what's happening inside each step. Your message gets tokenized, embedded, and pushed up through the whole stack of attention layers, which produces a ranked list of likely next tokens. One is chosen. It's appended to the conversation. The whole thing runs again for the next token. And again. Word by word, which is exactly why replies stream onto your screen rather than appearing all at once.

Generation is one frozen loop.

No learning happens here — the parameters are locked. Rank the next token, pick one, append, repeat.

Prompt → tokens

Text is split into tokens.

→

Tokens → vectors

Each token becomes a list of numbers.

→

Up the attention stack

Layers let tokens read each other for context.

→

Rank next tokens

Out comes a ranked list of likely next tokens.

→

5 🔒

Pick one, append

Choose one, add it, run the whole thing again.

↺ next token · this happens once per token, ~dozens of times per sentence

🔒 Parameters are frozen — no learning happens here; it's pure read-out.

Birth vs. life. Everything before inference — months, once, in a data center. Inference itself — a fraction of a second, every message, for every user on Earth.

Two things make inference the part of the lifecycle that businesses obsess over:

It's the recurring cost. Pre-training is a giant one-time bill. Inference is the meter that runs forever — every message from every user re-runs that whole stack. This is why "tokens per dollar" and clever tricks to make inference cheaper (we'll meet the KV cache and others later) are where a huge amount of real engineering money goes. A startup's margins often live or die here.
It's where you, the user, get a dial. Remember Stage 5 from the on-ramp — sampling. The model hands back probabilities; how you pick from them is a choice. Pick the single most-likely token every time and you get safe, repetitive, slightly robotic text. Allow some randomness and you get creativity — and, past a point, nonsense. That dial is temperature, and it's worth feeling with your own hands.

Temperature = creativity dial

Same probabilities, different boldness. Temperature is the user's dial between safe and creative.

FIXED PROMPT

The weather today is ___

Temperature: 0.8

Illustrative distribution; real vocabularies are ~100k tokens. Softmax-with-temperature, precomputed.

That's the full lifecycle: an empty architecture, filled with world-knowledge by pre-training, taught manners by post-training, and finally run, token by token, every time you ask it something. One story, eight stages, start to finish.

But that story took one thing for granted at every step: the architecture underneath it. We kept saying "the model" as if its shape were obvious — yet a decade ago almost nobody would have built it this way. So before we go any deeper, it's worth asking the question that quietly explains the whole modern era: why this design, and not one of the dozens that came before it? The answer isn't really about language. It's about the machines we feed.

Section

Why Transformers won

The idea that fit the machine

We just watched attention work: every word turns up the volume on the words that matter to it and tunes out the rest. That's the what. But a clever idea isn't enough to flip an entire industry — plenty of clever ideas die in a drawer. So the real question, the one that explains why the 2020s look the way they do, is: why did this architecture beat everything that came before, so completely that the whole field abandoned the old way almost overnight?

The short answer is going to surprise you. Transformers didn't just win because they understood language better. They won because they were shaped exactly right to be fed by the machines we happen to build — and that single fact is the thread that ties the math of the model to the dollar cost of running it. Let's earn that.

B1 · The old way — reading through a straw

One word at a time, memory smearing as you go

To feel why attention was a breakthrough, you have to feel the pain it cured. So rewind to before 2017.

The leading approach to language back then was the RNN — a recurrent neural network, where "recurrent" just means it loops over the text one piece at a time. (You met this briefly in Section A; now we open it up.) Picture reading a sentence through a straw: you can see exactly one word, you read it, you update a little running summary in your head — a single mental "state" meant to hold everything important so far — then you slide the straw to the next word and repeat. The model never sees the whole sentence laid out; it sees a parade of single words and a memory it keeps rewriting.

The most famous version, the LSTM (Long Short-Term Memory), was a genuinely brilliant patch on this idea: it added little gates that decided what to keep in memory and what to forget, so the running summary wouldn't get instantly overwritten.¹⁵ For years, LSTMs and their variants were the leading approach to language. (There were also CNNs — convolutional networks borrowed from image processing — used on text by sliding a small window across the words; faster than RNNs, but they only ever looked through a fixed-size window, so distant words still couldn't easily talk to each other.) The straw got better. It was still a straw.

And the straw had two flaws that no amount of cleverness could fully fix:

Flaw one — the memory fades. Everything the model knows about the sentence has to be squeezed, at every step, into that one running summary. By the time it reaches the end of a long paragraph, the beginning has been overwritten dozens of times — diluted, smeared, half-forgotten. This is the long-range dependency problem: in "The strawberry that I picked from the garden behind my grandmother's old house last summer was ripe," the word "was" needs to connect back to "strawberry," but fifteen words of memory-rewriting sit in between. The signal has to survive a game of telephone. Often it doesn't.

Flaw two — and this is the one that actually decided the war — you cannot do the work in parallel. Because each step depends on the running summary produced by the step before it, the model must process word 1, then word 2, then word 3, strictly in order. Word 50 cannot be computed until words 1 through 49 are done. There's no skipping ahead, no splitting the labor.

Reading through a straw

Each step has to wait for the one before it — and the earliest words fade as the running summary is rewritten over and over.

The → state → 🔒wait straw· → state → 🔒wait berry → state → 🔒wait …last → state → 🔒wait was

Faded boxes = words already smeared into the running summary. The 🔒 between every step is the bottleneck: step N cannot start until step N−1 finishes.

Schematic of the recurrent left-to-right chain. Fading is illustrative of the long-range dependency problem.

Hold onto that second flaw. It sounds like a mere engineering annoyance — so what if it's a little slow? But it's the hinge the whole story turns on, and here's why: the thing that makes modern AI work is scale — throwing enormous amounts of computation at enormous amounts of data. An architecture that forces you to do everything in single file can never absorb that much computation, no matter how much you're willing to spend. The straw had a speed limit baked into its shape.

B2 · Self-attention's escape — every word talks to every word, at once

Throwing away the loop

Now the move that broke it open. The 2017 paper Attention Is All You Need⁴ did something that, in hindsight, looks almost reckless: it threw away the recurrence entirely. No more loop. No more running summary passed hand-to-hand down the line. No straw.

Instead — self-attention, the mechanism from Section A: every token looks at every other token directly, in a single step, and decides how much each one matters to it. (Recall the dinner-party guest turning up the volume on the words that are relevant to "it.") And here's the part that matters for this section, the part to really sit with:

Distance stops mattering. In the straw, connecting "strawberry" to "was" fifteen words later meant the signal had to survive fifteen rounds of memory-rewriting. In attention, "was" looks straight back at "strawberry" in one hop — the same single step it uses to look at its immediate neighbor. A word a hundred tokens away and a word right next door are exactly the same distance to attention: one direct link. The game of telephone is gone. The fading-memory problem isn't patched — it's structurally deleted.

Same sentence, two architectures

The RNN walks the hallway one door at a time. Attention is in the room with everyone at once — and a far word is no harder to reach than a near one.

RNN (sequential)

long-range link = survive every step in between

straw·→ that→ I→ …→ summer→ was

↪ first → last: the signal has to snake through every intermediate state to arrive.

time 6 steps, in order

Self-attention (parallel)

long-range link = ONE direct hop, same as a near one

↔ first ↔ last: one bold link, the same single step used for any pair.

time all links at once

Illustrative. The bold clay link marks the one long-range pair to watch; the faint web is the full all-to-all pattern.

That alone would make attention better at understanding. But better-at-understanding is not what wins an industry. The thing that won the industry is hiding in the words "in a single step."

B3 · Parallelism — the unlock that made scale possible

A shape the hardware was starving to run

Here is the quiet revolution, and it's worth stating as plainly as possible because almost everything downstream depends on it.

Because attention looks at all the words at once instead of marching through them one-after-another, there is no longer a step that has to wait for the step before it. Word 50's attention can be computed at the very same instant as word 1's. The strict single-file ordering — the thing that throttled RNNs — is gone. The work can be spread out and done simultaneously.

Why does that change everything? Because of the hardware. A GPU (graphics processing unit — the chip originally built to draw video-game frames) is, at its core, a machine for doing thousands of small calculations at the same time. It is gloriously, ridiculously parallel. Feed a GPU a task that must be done in strict order, and most of those thousands of little workers sit idle, twiddling their thumbs, waiting their turn — which is exactly what an RNN does to a GPU. Feed it a task where everything can happen at once, and every worker lights up together.

Self-attention is the second kind of task. The Transformer didn't just understand language in a new way — it understood it in a shape the hardware was already starving to run.

The same chip, asleep or awake

The only difference is whether the math lets you use the whole chip at once. This is the whole ballgame.

RNN on a GPU

throughput~idle

The chip is mostly asleep; every core waits its turn while one bright worker steps across the grid.

Transformer on a GPU

throughputmaxed

Every core busy at the same time — the all-at-once math lights the entire grid.

Schematic of GPU core utilization. One clay core marks the lone active worker the RNN can keep busy.

This is the unlock. Once training could run in parallel across the whole sequence, you could throw vastly more computation at the problem in the same wall-clock time — which meant you could train on vastly more text, with vastly bigger models. And the central lesson of the modern era, the one we'll keep returning to, is brutally simple:

The architecture that could eat the most compute won. It wasn't necessarily the cleverest design in some abstract sense — it was the one that turned "spend more money on chips" directly into "get a smarter model." RNNs choked on scale; Transformers feasted on it. And the timing was perfect: scaling-laws research was just then showing that model capability climbs predictably as you pour in more compute⁵ — so the architecture that could actually absorb that compute, by running in parallel,⁴ was the one positioned to win. Once that became clear, the entire field pivoted, and it pivoted fast.

But I've been hand-waving with the word "compute." What, exactly, is the GPU doing thousands of times at once? Answer that, and the bridge from the model's math to its dollar cost falls right into place.

B4 · Matmuls — the bridge between the math and the money

A tower of giant multiplication tables

Here is the single most useful thing a non-engineer can understand about how these models actually run, and it fits in one sentence: underneath all the talk of attention and layers, a Transformer is, almost entirely, a tower of giant multiplication tables.

The technical name is matrix multiplication — "matmul" for short. A matrix is just a grid of numbers; multiplying two of them means doing a huge batch of "multiply these, add them up" operations to produce a new grid. You don't need the procedure. You need this: attention is matmuls, and the feed-forward layers between attention steps are matmuls. When the model decides how much "was" should attend to "strawberry," that's a matmul. When it pushes each word's vector through a layer to refine its meaning, that's a matmul. Stack a frontier model's hundreds of layers, and running it once is billions upon billions of these multiply-and-add operations — and almost nothing else.

Now watch the whole thread snap together, because this one fact is load-bearing for half the rest of this document:

Matmuls are the bridge: MATH → HARDWARE → COST

The same fact that makes Transformers scale beautifully is the fact that makes running them cost what it does. Remember this chain — it explains both the magic and the bill.

Link 1 · The model

A stack of Transformer layers — zoom into any one and it's just grid × grid. The model is mostly matmuls.

perfect fit

Link 2 · The hardware

A GPU is a machine built to do matmuls — thousands of multiply-adds at once. Key into a lock.

every token

Link 3 · The money

Every token you generate = another pile of matmuls = real electricity and chip-time.

MATH → HARDWARE → COST

Schematic. The grids and cost bars are illustrative; the chain — and its two-way reading — is the point.

Going left-to-right, matmuls explain the magic. A GPU is, almost literally, a purpose-built matmul machine. So an architecture made of matmuls maps onto a GPU like a key into a lock — every one of those thousands of parallel cores can be busy doing its little multiply-and-add. This is the deep reason Transformers scale so well: you can keep buying more GPUs, and because the work is matmuls all the way down, the model just keeps absorbing the extra muscle. The architecture and the hardware were made for each other.
Going right-to-left, the same matmuls explain the bill. Because every single token the model reads or writes triggers another full pass of these billions of multiply-adds, compute is the product. This is why generating a long answer costs more than a short one; why a bigger model costs more per word than a smaller one (more layers = more matmuls per token); why companies obsess over chips and electricity. When a later section talks about inference cost, "tokens per dollar," or why a startup's margins are thin — it's all downstream of this. Every token is more matmuls, and matmuls cost money.

Sit with that, because it's the rare idea that pays off in two directions at once. The matmul is simultaneously why the technology works and why it's expensive — the bridge between the math on the whiteboard and the line item on the invoice. Most people understand one side or the other. You now hold both ends of the same thread.

B5 · Hardware as the hidden hand — progress is gated by what chips do well

The silicon quietly selects the winner

Step back from Transformers specifically, because there's a bigger lesson here that will make you sharper about every future "breakthrough" claim you hear.

The instinctive story of progress is: someone has a brilliant idea, and the idea wins because it's brilliant. The truer story, especially in AI, is messier and more interesting: an idea wins when it's brilliant and it happens to fit the hardware we can build cheaply at scale. Progress is gated as much by what silicon does efficiently as by what's clever on paper.

Transformers are the cleanest example in history. The attention mechanism wasn't conjured from nothing in 2017 — pieces of it existed earlier. What changed is that someone built an architecture that was all attention and all matmul, with the recurrence stripped out — and that turned out to be the shape that let the GPUs we were already mass-producing run flat-out. An equally clever architecture that didn't fit the hardware — that demanded, say, lots of strict step-by-step ordering, or some operation GPUs are bad at — would have lost, no matter how elegant. We can't fully know, because the hardware-friendly idea is the one that got to eat all the compute and therefore got all the investment, all the engineering, all the refinement. The hardware doesn't just run the winning idea. It quietly selects which idea gets to win.

Why this matters for evaluation. When a founder pitches "a fundamentally new architecture that beats Transformers," the sharp follow-up isn't "is it clever?" — it's "does it map onto the hardware people actually own?" A design that's smarter on paper but fights the GPU (or can't ride the same massive supply chain of chips) starts the race with a boulder on its back. Many promising "Transformer killers" have stalled for exactly this reason: not because the math was wrong, but because the silicon wasn't on their side. Cleverness is necessary. Hardware-fit is what's decisive.

This is also the lens for understanding why so much frontier effort goes into co-design — tweaking the architecture and the chips toward each other. FlashAttention, for instance, didn't change what attention computes at all; it just reorganized how the computation moves data around inside the GPU's memory so the chip stops waiting around — and that single hardware-aware rewrite made attention dramatically faster and cheaper, with identical results.¹⁶ The lesson repeats at every level: in this field, knowing the hardware is knowing the algorithm.

B6 · The catch — attention's cost grows with the square of the input

Double the text, quadruple the work

Everything above is the triumph. Now the limitation, because no honest account of why Transformers won can skip the price tag they carry — and this particular catch is the seed of half the open problems in the field.

Go back to the thing that made attention magical: every token looks at every other token. That's a wonderful property for understanding. It's a punishing property for cost, and the reason is just counting. If you have 10 tokens and each must look at all the others, that's roughly 10 × 10 = 100 little comparisons. Fine. But double the input to 20 tokens, and it's 20 × 20 = 400 — you doubled the text but quadrupled the work. Go to 100 tokens and it's 10,000 comparisons; 1,000 tokens, a million. The work grows with the square of the length, not in step with it. Engineers call this O(n²) — "order n-squared" — and it just means: as the input gets longer, the cost balloons far faster than the input itself.

The N×N square — root of the context window

Step the input up and watch the grid of token-pair comparisons explode. This single square is one of the hottest research frontiers in the field.

N=4 → 16 cells

N=16 → 256 cells

Sequence length N: 4 tokens

At N=4 the grid is a tiny 16 cells. Drag right: the text grows in a line, the work grows in a square.

Illustrative. Cost curve is n² normalized against a linear reference; the clay marker tracks the slider's N.

This one fact has consequences you've felt without knowing the cause:

It's the root of the context window. A model's context window — the maximum amount of text it can consider at once (its prompt plus the conversation so far) — isn't an arbitrary number a lab picked. It's a budget. Because attention cost grows with the square of the length, letting the model "remember" twice as much text can cost roughly four times as much compute and memory. Long context is expensive precisely because of this square. When a lab announces "we doubled the context window," they're really announcing they paid (or engineered their way around) a quadratic bill.
It's why "just feed it the whole book / codebase / database" isn't free. People intuitively want to dump everything into the prompt and let the model sort it out. The square is the reason that gets costly fast, and why a whole ecosystem of workarounds exists — retrieval (fetch only the relevant pages instead of all of them), summarization, and a parade of "efficient attention" schemes that try to approximate all-tokens-look-at-all-tokens without paying the full n² price. None of these are free lunches; they trade some of attention's purity for affordability.

Why this matters for evaluation, and where it points next. The O(n²) wall is one of the most active research frontiers in all of AI — entire approaches (linear-attention variants, state-space models, sparse and sliding-window attention, and more) exist primarily to dodge it. So when a startup claims a "1-million-token context" or a "Transformer-killer that scales linearly," you now know the real question: what did they give up to beat the square? Sometimes the answer is a genuine, clever win. Often it's an approximation that quietly degrades the very long-range understanding that made attention worth having in the first place. The quadratic cost is the tension every long-context and next-architecture claim is wrestling with — and it's exactly the thread we pick up in the sections on context windows, inference cost, and what might come after the Transformer.

So that's why Transformers won: they deleted the fading-memory problem with direct all-to-all attention, they broke the single-file bottleneck so training could finally run in parallel, and — the decisive part — that parallel, matmul-shaped work fit the GPUs we could mass-produce, turning money straight into intelligence. They won the way most technologies actually win: not purely on elegance, but on the marriage of a good idea to the machine that could run it. And the same all-to-all attention that made them brilliant is the n² catch that now defines their frontier. Keep that square in your back pocket — it's about to explain a lot.

State of play — fenced off because it dates. The mechanisms above will not.

STATE OF PLAY — June 2026
· The Transformer still anchors essentially every frontier model — GPT-5-series,
  Claude Opus 4.6/4.7, Gemini 3.1 Pro, DeepSeek (V3.2/V4). No challenger
  architecture has displaced it at the top, though hybrids are creeping in.
· The O(n²) attention cost remains the central scaling tension. Production
  long-context models lean on FlashAttention-style hardware-aware kernels plus
  sparse / sliding-window / linear-attention tricks and retrieval to fake very
  long context affordably — not pure all-to-all attention at length.
· State-space models (the Mamba line) and other sub-quadratic designs are the
  most-watched "after the Transformer" candidates, increasingly shipped as
  HYBRIDS (a few attention layers + many cheap layers) rather than full
  replacements. The reason is the lesson in B5: hardware-fit decides, and the
  GPU ecosystem is built around matmul-heavy attention.
Specific models/numbers will age fast; the mechanisms above will not.

Section

Pre-training

Back in Section A we walked the whole lifecycle and gave each stage a single sentence. Pre-training got this one: "reading the internet to learn the world." We said it's where a model gets its raw intelligence, that it's the most expensive stage, and that it ends with a knowledgeable-but-feral base model that isn't yet an assistant. We even met Chinchilla in passing.

This is the zoom-in. By the end of it you'll understand exactly what the model is doing during those months in the data center, why it produces real knowledge out of a game that sounds almost too dumb to work, and where the whole approach hits a wall the field is openly worried about. Keep the lifecycle diagram from Section A in your head — we're standing on the box labeled Pre-training, and everything here is what happens inside it before the model ever gets its manners.

You are here.

Pre-tokenize

→

Pre-training

→

Post-training

→

Reasoning / RL

→

Serving

We zoom into the one box that takes months, tens of thousands of GPUs, and most of the money — and produces a brilliant, useless genius.

One stupidly simple game, played a trillion times

Here is the entire objective of pre-training, with nothing left out:

Show the model some real text, with the next chunk hidden. Let it guess the hidden chunk. Check the guess against the truth. Nudge the model's internal numbers a hair toward what would have been right. Repeat — across roughly the whole readable internet.

That's it. That's the game. It's called next-token prediction, and it is the only thing happening in pre-training. (Recall from the on-ramp that a token is a chunk of text — usually a word or word-fragment — and that the model, at heart, is "the world's most sophisticated autocomplete." Pre-training is where that autocomplete is built.)

The reason this is so beautiful — and the reason the whole field rests on it — is that the answer key is free. Every sentence ever written is its own worked example: the next word is sitting right there in the text. Nobody has to label anything. You don't pay humans to say "the correct continuation of The capital of France is ___ is Paris." The text already told you. This is why pre-training can run on trillions of tokens — the supervision is baked into the data itself. (The jargon for this is self-supervised learning: the data supervises the model, no human grader required.)

Play one step of pre-training.

Guess the next token, then reveal what the model would predict. It isn't memorizing facts — it's learning a distribution over how the world tends to continue.

Example 1 of 2 · an easy one

nudge the knobs so the true token scores higher next time

Precomputed illustrative distributions. You just did one step. Now do it a trillion times, over everything humans have written — that's the whole training run.

Now the part that should genuinely surprise you. This game sounds like it should only teach the model to mimic surface text — to be a fancy parrot. But to get good at predicting the next token, the model is quietly forced to learn the machinery underneath the text:

To finish "The capital of France is ___," it has to absorb a fact about the world.
To finish "She poured the water into the glass until it was ___," it has to track physical cause and effect (full, not empty).
To finish "2 + 2 = ___," it picks up a sliver of arithmetic.
To finish a line of code, it has to internalize syntax and logic.
To finish the last line of a sonnet, it learns rhyme, meter, and the shape of an argument.

None of these were taught directly. There is no "facts" stage and no "physics" stage. There is only next-token prediction. All of the model's knowledge, grammar, and rough world-model are side effects of getting very, very good at one autocomplete game. That's the single most important idea in this section, and it's worth sitting with, because it explains both why these models are so eerily capable and why they fail in the specific weird ways they do (the strawberry-counting from Section A is one — the letters were never in the game).

Everything fell out of one task.

↺

predict the next token

the only objective

World facts

…is Paris

Cause & effect

…until it was full

Arithmetic

2 + 2 = …

Code syntax

…) { return …

Grammar

…agrees with the subject

Translation

chien → dog

Basic reasoning

therefore …

Story structure

…and the final couplet rhymed

Nobody programmed any of these in. They all condensed out of relentlessly practicing one prediction.

Why the field landed here. Before this, teaching a machine a skill meant collecting a labeled dataset for that skill — expensive, narrow, and capped by how much labeling you could afford. Next-token prediction broke the cap: the entire internet became free training data, and a single objective produced a general model. The bitter, repeatedly-relearned lesson of the last decade is that this dumb, scalable approach beats clever hand-built ones almost every time. That's why every frontier lab pre-trains the same basic way.

The nudge, opened up — how the model knows which way is “right”

We've now said “nudge the numbers toward what would have been right” four times without explaining the trick that makes it possible — and it's genuinely the most important trick in the field, so let's open it up. The puzzle is sharper than it looks: a model has billions of these internal numbers, so when its guess is wrong, how does it know which way to nudge each one? You can't know what each number does — and here's the part that surprises everyone: nobody does. No human has ever looked at weight number 4,000,000,000 and known what it “means.” The astonishing thing is that you don't have to. Here's how, in four moves.

1 · Turn “how wrong?” into a single number. After the model guesses, you compare its prediction to the true next token and boil the gap down to one number — the wrongness (the field calls it the loss). Guessed “Paris” confidently and the answer was “Paris”? Tiny wrongness. Confidently guessed “banana”? Huge wrongness. That single number is the model's whole report card for this one example: how badly did I just do?

2 · For each knob, ask only: which way reduces the wrongness? Now the key move. For every one of the billions of knobs, you ask a narrow question: if I turned this knob up a hair, would the wrongness go up or down — and how sharply? That answer — a direction and a steepness, like the slope of the ground under each knob — is all you need. Notice what you didn't need: you never had to know what the knob means. It might secretly relate to French geography, or to nothing nameable at all — irrelevant. You only need its effect on the wrongness. That's the escape from the puzzle: a knob reveals which way to move not by what it represents, but by how it changes the error. (This direction-and-steepness, gathered for every knob, is called the gradient.)

↻

turn this way →

wrongness

↓ drops

You never learn what the knob means — only which way it should turn to be less wrong. That's enough.

3 · How you get all billions of directions at once: work backward. The natural worry: computing that slope for billions of knobs sounds impossibly slow. The breakthrough that makes modern AI even possible is a way to get them all in one sweep, by working backward from the mistake. Picture a company that just shipped a defective product. Rather than interview all billion employees separately, you start at the end — the defect — and pass blame backward: the final team says “we caused this slice; the rest came from what was handed to us,” and hands the remainder to the team before them, who do the same, on back to the start. In one backward pass, every employee learns their exact share of the blame. The network does precisely this — the wrongness flows backward through its layers, each knob picking up its share with one cheap local calculation. That backward blame-assignment is backpropagation, the single algorithm without which none of this scales.

4 · Nudge every knob a hair downhill. Repeat. Now each knob has its marching order — which way, how hard — so you nudge all billions a tiny step in their wrongness-reducing direction. The model is now imperceptibly less wrong on that one example. Do it on the next chunk of text, and the next, trillions of times. This whole loop — measure the wrongness, find each knob's downhill direction, take a tiny step — is gradient descent, and the picture is exactly that: you're walking the wrongness downhill, one small step per example, into a low valley where the model's predictions are usually right.

One honest wrinkle — and why it mostly doesn't bite. “Always step downhill” has a famous trap: you can get stuck in a small dip when a deeper valley sits just over the next hill — and a pure downhill-walker never climbs out, because climbing means getting temporarily worse. This worried the field for years. What rescued it is partly luck of scale, and the reason is worth feeling. Picture the trap in two dimensions — a valley on a map — and it's easy to get boxed in: there are only so many ways to turn. But a real model isn't moving in 2 dimensions; it's moving in millions (one per knob). To be truly stuck, every single one of those millions of directions would have to point uphill at once. With that many directions, there's almost always at least one that still tilts downhill — an escape hatch — so genuine traps are vanishingly rare. The very high-dimensionality that makes the model impossible to picture is what keeps it from getting stuck. And partly it's deliberate engineering — training is given a little momentum (so it rolls over small humps like a ball with speed) and a little randomness (so it gets jiggled out of shallow dips). The result isn't the perfect bottom — it's a reliably good one. Worth remembering whenever someone claims a training run found “the optimal” anything: it found a good valley, not the global floor.

That's the entire engine. Strip the jargon and it's just: show an example, measure how wrong, let each knob feel which way is downhill, step, repeat — without anyone ever knowing what a single knob means. The same loop, run on the next-token game across the whole internet, is what did all that nudging in C1. And it's worth holding onto, because every later way of shaping a model — the fine-tuning that turns this raw model into an assistant (Section D), the reinforcement learning that teaches it to reason (Section E) — is this same loop, just pointed at a different measure of “wrong.” Learn it once and you understand how every model, at every stage, is trained.

So with that engine in mind — that's what “all that nudging” was actually doing. Run it to completion and you get the strange object we turn to next.

The base model — a brilliant genius with no manners

So pre-training finishes. You now have a base model: billions of parameters (the internal knobs, tuned by all that nudging) holding a compressed, tangled imprint of human writing. It knows a staggering amount. And it is not a chatbot.

This trips up almost everyone, so let's make it concrete. A base model is a pure text-completer. It does exactly one thing: continue whatever text you give it, in the most statistically plausible way, based on the internet it ate. It has no concept that it's supposed to help you. Ask a raw base model:

"What's a good recipe for banana bread?"

…and instead of a recipe, it might continue:

"What's a good way to keep bananas from going brown? What's the best pan to use? These are the questions every home baker asks…"

Why? Because on the open internet, a question is very often followed by more questions — in forum posts, FAQs, listicles. The base model isn't broken. It's doing its job perfectly: predicting the most plausible continuation. It just turns out that "answer the user's question helpfully" is not the most statistically common thing that follows a question on the raw web. The model is, as we put it in Section A, brilliant but feral.

Same brain, different behavior.

Base model · after pre-training

It completes, it doesn't answer

"What's a good recipe for banana bread?"

"What's a good way to keep bananas from going brown? What's the best pan to use? These are the questions every home baker asks…"

Knows everything, helps with nothing.

→

Section D
post-training

Helpful model · after post-training

Same knowledge, taught to be useful

"What's a good recipe for banana bread?"

"Here's a simple loaf: 3 ripe bananas, 1⅓ cups flour, ⅓ cup melted butter, ¾ cup sugar, 1 egg, 1 tsp baking soda. Mash, mix, bake at 350°F for ~55 min…"

Same brain — now it answers.

Pre-training builds the raw intelligence. It does NOT build the helpfulness. Those are two different jobs.

This is the cleanest way to hold the whole architecture of model-building in your head:

Pre-training builds the raw intelligence — the knowledge, the world-model, the latent skills. It's most of the cost and most of the capability.
Post-training builds the helpfulness — the instinct to answer, to follow instructions, to behave. It's comparatively cheap, and it adds almost no new knowledge.

That second job — the finishing school that turns this feral genius into the assistant you actually chat with — is the entirety of Section D. For now, just lock in the division of labor: pre-training makes it smart; post-training makes it helpful, and they are not the same thing. When you hear a lab talk about "the base model" versus "the chat model" or "the instruct model," this is the line they're drawing.

Why this matters for evaluation. A lot of a model's personality and safety lives in post-training, but almost all of its raw ceiling — what it could ever possibly know or reason about — is set here, in pre-training. If a startup is fine-tuning someone else's open base model, they're decorating an intelligence they didn't build, and its ceiling is mostly fixed. That's not necessarily bad — but you should know which job they're actually doing.

The model is what it eats — data quality and the mixture

If pre-training is the model learning the world by reading, then the single biggest lever on what it becomes is what you let it read. The slogan, only half a joke: the model is what it eats. Garbage in, garbage out — at the scale of the entire internet.

There are two knobs here, and they're different:

1. Quality. The raw internet is mostly junk — spam, broken HTML, duplicated boilerplate, SEO sludge, toxic comment threads. Feed that in unfiltered and you get a model that has absorbed the internet's worst habits. So labs spend enormous, unglamorous effort filtering: deduplicating, stripping low-quality pages, scoring text for usefulness, scrubbing the worst content. The headline result the field keeps re-confirming is blunt: better-filtered data produces a better model at the same size and compute. The FineWeb work, for instance, showed that careful curation of web text measurably lifted downstream performance — the data recipe, not just the model, was the win.¹⁹

Most of pre-training is deciding what to throw away.

Raw internet — ~everything

spam · dupes · broken pages · forums · gold nuggets, all mixed

↓ dedupe

Deduplicated

strip near-identical copies

↓ quality-score

Quality-scored · toxic removed

drop low-value & harmful text

↓ format

Curated training corpus

clean, formatted, ready to feed

Illustrative: often the large majority of crawled text — well over half, and in aggressive pipelines the vast majority — is discarded.

Curating the diet is a real, guarded competitive edge — most of the engineering is deciding what's worth feeding it.

2. Mixture. Beyond cleaning, labs deliberately blend their data — so much web text, so many books, so much code, so much math, so much from each language. And the blend shapes the abilities. The most striking example is a widely-reported observation across lab tech-reports (rather than a single settled paper): adding more code to the training mix appears to make a model better at reasoning — even on tasks that aren't coding — apparently because code is unusually rich in clean, explicit, step-by-step logical structure. The effect is real enough that labs act on it; the precise mechanism and magnitude are still debated, so hold it as a strong regularity, not a law. Want a model that's good at structured thinking? Feed it more of the most structured text humans produce. The mixture is a dial on cognition, not just on vocabulary.

Same size, different diet, different mind.

Config A · code & math pushed up

Web

Books

Code

Math / STEM

Multilingual

Conversations

→ stronger reasoning & coding

Config B · multilingual pushed up

Web

Books

Code

Math / STEM

Multilingual

Conversations

→ stronger translation, more even cross-language quality

Same architecture, same size — the mixture is one of the most closely-guarded recipes in the field. (Sliders illustrative, not measured.)

Why the field moved here. Early on, "more data" meant "more web text," full stop. The field gradually learned that which data matters as much as how much — that you can buy capability not just with bigger crawls but with smarter curation and a deliberate mixture. This is also why "we have proprietary, high-quality data" is one of the more credible moats a startup can claim: when public high-quality text is finite (hold that thought — C4 and C6), a private corpus of clean, relevant text is genuinely valuable.

What it still can't do: data curation can't conjure knowledge that isn't in the data, and it can't fully remove the biases woven through human writing — it can only shift them. And every filtering choice is itself a judgment call about what's "good" text, which bakes the curators' assumptions into the model. The diet shapes the mind, including its blind spots.

Chinchilla — how to spend a fixed pile of money

Now the one place in this section where a number earns its keep. Everything else here you can hold as intuition; this one result reshaped how every major lab budgets a training run, so it's worth feeling precisely.

Start with the core tension. A pre-training run has a roughly fixed budget — a fixed amount of compute (GPU-hours, which is to say money and time). You get to spend that budget on two things, and they trade off against each other:

Model size — how many parameters (knobs) the model has. Bigger = more raw capacity to store patterns.
Training data — how many tokens you show it. More = more experience to learn from.

Spend more on one and, for a fixed budget, you must spend less on the other. So: bigger model trained on less data, or smaller model trained on more data?

For years the field's instinct — codified in an influential 2020 paper from Kaplan and colleagues²⁰ — leaned hard toward size. Make the model as big as you can; data was treated as the lesser priority. The result was an arms race of ever-bigger parameter counts (175 billion, 280 billion, 530 billion…), and headlines that measured a model by its size alone.

Then in 2022, DeepMind's Chinchilla paper²¹ ran the experiment properly — training over 400 models across a wide range of sizes and data amounts — and delivered the verdict that the giants had been doing it wrong. Their finding, in plain words:

Most of the big models were badly undertrained. They had too many parameters and had been shown too little data. For the same compute budget, you'd get a smarter model by making it smaller and feeding it far more text. Roughly: every time you double the model's size, you should also double its training data — keep them balanced.

They proved it the only way that counts: they built Chinchilla (70 billion parameters, trained on ~4× more data) and it beat Gopher (280B), GPT-3 (175B), and the 530B Megatron-Turing model — while being smaller and therefore cheaper to run. A model a quarter the size, trained right, won.

This is the marquee chart of the section. It deserves to be felt, not read:

Bigger isn't smarter. Balanced is smarter.

For any fixed budget there's a Goldilocks point. Too big is just as wrong as too small — and the whole industry had been overshooting to the right.

Compute budget medium

More budget slides the whole sweet spot down-and-right — you can afford a bigger model and more data. The optimum isn't a fixed size; it depends on what you're spending.

After Hoffmann et al. 2022 (Chinchilla). Illustrative U-curve; no equations, no hard axis numbers.

Why it changed everything. Chinchilla didn't just tweak a hyperparameter — it told every lab they'd been wasting compute by overshooting on size. Almost overnight, the bragging metric shifted: "how many parameters?" started to matter less than "how many tokens did you train on?" Modern models are, by Chinchilla's logic, often deliberately smaller than their predecessors but trained on vastly more data (many trillions of tokens) — which also makes them cheaper to run, a double win. When you internalize this, the Section A warning lands hard: a giant parameter count is a vanity metric. The right question is always "trained on how much, and on what?"

And now the limitation that sets up the rest of the section — the one Chinchilla can't answer. Its rule says "for more compute, use a bigger model and more data." But it quietly assumes you can always get more data. In the real world, high-quality text is finite. Chinchilla tells you the ideal recipe; it doesn't tell you what to do when you run out of the main ingredient. That ingredient shortage is the data wall — and it's the live debate of this era.

Continued pre-training — you don't always start over

Before we hit the wall, one practical and often-misunderstood move. A finished base model isn't necessarily done being pre-trained. You can take that completed model and keep running the same next-token game on new text — fresher data, or a specialized corpus — to extend or specialize it without paying for a full training run from scratch. This is continued pre-training (sometimes "continual" or "domain-adaptive" pre-training).

Two common uses make it concrete:

Freshening. A model's knowledge stops at its knowledge cutoff — the date its training data ends (more on this limitation next). Continued pre-training on more recent text can push that cutoff forward without rebuilding the model.
Specializing. Take a strong general base model and continue pre-training it on, say, a giant pile of medical literature, legal filings, or a company's internal codebase. The model keeps all its general intelligence but gets much deeper fluency in the specialized domain — far cheaper than training a domain model from zero.

You rarely start from zero.

Pre-training (from scratch)

months · tens of millions of $$$ · the whole internet

Base model v1

Continued pre-training

same next-token game, new/specialized data · days-to-weeks, a fraction of the cost

Specialized / freshened model

A finished base model is a launch point — keep feeding it the same game (C1) on new data to extend or specialize it.

Why this matters for evaluation. Continued pre-training is one of the main ways smaller players build something genuinely valuable on top of an open base model — a real domain specialist for a fraction of frontier cost. When a startup says "we trained our own medical model," this (plus Section D's fine-tuning) is often what they actually did, and it can be a legitimately strong product. The honest question is whether they added domain depth (credible) or just slapped a system prompt on someone else's model (much weaker).

What it still can't do: continued pre-training can add and freshen knowledge, but it can't fully overwrite what's baked in, and pushing too hard on a narrow corpus risks catastrophic forgetting — the model getting better at the specialty while quietly getting worse at general skills it used to have. It's a launch point, not a magic wand.

The limits — where pre-training runs out of road

Pre-training is the engine of the whole field, but it has hard edges. You need these to evaluate any "we'll just scale up" claim, because "just scale up" is exactly what's getting harder.

The data wall — high-quality text is finite. This is the big one, and it's a genuinely open debate, not settled doom. Chinchilla says "more compute → use more data." But there's only so much good human-written text in existence. The most-cited estimate, from Epoch AI's Villalobos and colleagues, puts the usable stock of high-quality public text at very roughly a few hundred trillion tokens, and projects that frontier training could exhaust the high-quality public supply sometime in the second half of the 2020s if current trends hold.²² Past that point, you can't follow Chinchilla's advice anymore — there isn't enough fresh, clean data to balance ever-bigger models. This is the data wall, and it's reshaping strategy across the field. (See the State-of-play box for where that debate actually stands in mid-2026.)

Appetite meets supply.

We're not out of data — we're running low on cheap, high-quality, PUBLIC text. The race is now about what to do when the easy ingredient runs short.

Escape hatch 1

Synthetic data

models generating their own training text

Escape hatch 2

New modalities

video, audio, images as fresh signal

Escape hatch 3

Smarter use of existing data

better curation, more passes

Illustrative shapes after Villalobos et al. (Epoch AI) 2022/2024. The shape is the lesson, not the numbers.

Baked-in biases and a frozen knowledge cutoff. Because all knowledge is a side effect of the training data (C1), the model inherits the data's biases — its skews, gaps, and prejudices — and its knowledge freezes at the knowledge cutoff (the date the data ends). A pre-trained model literally cannot know about events after its cutoff; it will confidently reason as if the world stopped on that date. (This is one big reason real systems bolt on tools like web search — to paper over a limitation that's structural to pre-training.)

Raw scale has diminishing returns. Early on, every order-of-magnitude more compute bought dramatic jumps. Increasingly, it buys less per dollar — the curve is bending. Bigger-and-bigger alone is no longer the obvious path to a smarter model, which is a large part of why the field's energy has shifted toward post-training and reasoning (Sections D and E) rather than just inflating pre-training. Scale still matters; it's just no longer a free lunch.

And the one we opened with: a base model is unsafe and unhelpful on its own. Pre-training, by itself, never produces a usable assistant — only a feral text-completer (C2). It will happily continue toxic text, make things up fluently, and ignore your actual request. Everything that makes a model safe, steerable, and helpful comes later. Pre-training builds the raw mind; it does not, and cannot, build the manners.

That hand-off — from a brilliant, dangerous, finished base model to the polite assistant you actually talk to — is precisely where Section D · Post-training picks up.

STATE OF PLAY — June 2026
· The "data wall" is real but contested. High-quality PUBLIC text is the scarce
  resource; total data (private, synthetic, multimodal) is not. The frontier
  debate is no longer "are we running out?" but "do the escape hatches work?"
· Escape hatch #1 — SYNTHETIC DATA — is now mainstream: frontier models train
  partly on text generated/curated by other models. Bull case: it extends the
  supply and lets you target weak spots. Bear case: "model collapse" — training
  on your own exhaust can quietly degrade quality if done carelessly.
· Escape hatch #2 — NEW MODALITIES — video/audio/image are increasingly treated
  as fresh pre-training signal as text tightens.
· Compute-rich players (the largest US labs, plus well-funded Chinese labs like
  DeepSeek/Qwen) can still out-scale; but the Chinchilla-era "just add data and
  parameters" reflex has clearly given way to data-quality, synthetic-data, and
  post-training/RL competition as the main levers.
· Continued pre-training + domain corpora is the standard way smaller players
  build credible specialists on top of open base models.
Specific labs, token-stock estimates, and exhaustion dates will age fast; the
mechanisms (next-token prediction, the size↔data tradeoff, the diet effect,
the finite-data tension) will not.

Section

Post-training

Section C left us with a strange object. Pre-training spent months and most of the money to produce a base model — billions of tuned knobs holding a compressed imprint of the whole readable internet. It knows a staggering amount. And it is useless to talk to. Ask it for a banana-bread recipe and it might reply with more questions about banana bread, because on the open web a question is most often followed by more questions, not by a helpful answer. We called it what it is: a brilliant, feral library that talks like the average of the internet — knows everything, helps with nothing.

This section is the finishing school. Its whole job is to turn that feral genius into the polite, instruction-following assistant you actually chat with — without lobotomizing the intelligence Section C paid so much to build. And here is the thing to hold onto before we start: nothing in this section is a new kind of machine. It is C-core's same gradient-descent loop — show an example, measure how wrong, let each knob feel which way is downhill, take a tiny step — pointed at new data and a new measure of "wrong." Same brain. New manners. By the end you'll understand how taste gets turned into a number a model can chase — and that number is exactly the baton this section hands to Section E.

There is one piece of post-training big enough to need its own section: reinforcement learning, the part that teaches a model to reason. That's Section E. Everything else — the parts that make post-training matter but aren't the RL loop itself — is here.

Why post-training exists at all

Start with the cleanest division of labor in the whole field, because it's the one most people get wrong:

Pre-training builds the raw intelligence — the knowledge, the world-model, the latent skills. Most of the cost. Most of the capability.
Post-training builds the behavior — the instinct to answer rather than continue, to follow instructions, to refuse the genuinely harmful, to sound like an assistant and not like a scraped forum. Comparatively cheap. And — this is the surprising part — it adds almost no new knowledge.

That last clause is the load-bearing idea of the section, so let's make it concrete. The feral base model already knows the banana-bread recipe — the ingredients, the steps, the oven temperature are all in there, soaked up from ten thousand recipe blogs during pre-training. What it lacks is the instinct to hand them over when asked. It has the knowledge and not the manners. Post-training doesn't teach it to bake. It teaches it that when a human asks a question, the thing to do is answer it.

The slogan for the whole section: same brain, new manners. You are not making the model smarter. You are changing what it does with the smarts it already has.

This framing also tells you exactly what post-training can't fix, which matters enormously for judging any "we fine-tuned our own model" claim. If the knowledge isn't already in the base model, no amount of polite finishing-school behavior will conjure it — and as D2 will show, trying to stuff it in is actively dangerous. Post-training is a behavior dial, not a knowledge faucet. Hold that; it's the thread that runs through every subsection here and snaps shut in D6.

So the question this whole section answers is narrow and concrete: how do you civilize a feral text-completer — change its behavior, its instincts, its manners — using the exact same training loop that made it feral in the first place? There turn out to be three escalating answers, and they form the spine of D2 through E. We start with the bluntest one.

Supervised fine-tuning — teaching by example

The first and simplest move is the obvious one: show the model what good behavior looks like, and let it imitate. This is supervised fine-tuning — SFT — and at the mechanical level it is identical to pre-training. Same next-token game, same gradient descent, same backpropagation. The only thing that changes is the diet. Instead of feeding the model the raw internet, you feed it a curated pile of example conversations: a human writes a prompt, an expert writes the ideal response, and the model practices predicting that ideal response token by token.

That's it. The model that learned to autocomplete the internet now learns to autocomplete good assistant answers — and because it's such a powerful autocomplete, a few thousand well-chosen examples are enough to flip its default behavior from "continue the text" to "answer the question." The feral library starts acting like a librarian.

How a completion engine learns to take turns. There's a subtle problem hiding here, and solving it is most of what makes a chatbot a chatbot. A base model has no concept of "you" and "me" — it just continues text. So how does it learn the structure of a conversation: where your turn ends, where its turn begins, when to stop talking? The answer is a small, almost clerical trick called a chat template. The training examples are wrapped in invisible special tokens — little structural markers, never shown to you, that label each span of text: here begins the system instruction, here begins the user's turn, here begins the assistant's turn, here the assistant stops.

Inside the chat template

A chatbot is a text-completer that learned to fill in the "assistant" slot of a rigid template — and to stop when the slot is done.

What you see

User How do I make banana bread?

Assistant Start by mashing 3 ripe bananas, then mix in 1/3 cup melted butter…

invisible scaffolding revealed below

What the model sees

<|system|> You are a helpful assistant. <|end|>
<|user|> How do I make banana bread? <|end|>
<|assistant|> Start by mashing 3 ripe bananas… <|end|>

The model is trained to stop generating when it emits the final <|end|> — which is how it knows not to hallucinate the user's next question.

Illustrative; exact token names vary by model family (e.g. <|im_start|>, [INST], <|eot_id|>). The structural pattern is standard across SFT training.

Once the model has seen tens of thousands of conversations laid out in this rigid format, it absorbs the shape of dialogue the same way it absorbed grammar in pre-training — as a statistical regularity. It learns that text following the "assistant" marker should be helpful and on-topic, and that it should emit the "stop" token when its answer is complete rather than rambling on or, worse, hallucinating the user's next question (which, remember, is exactly what the feral base model wanted to do). The turn-taking you take for granted in any chatbot is this trick, learned by imitation.

Data quality is the product. Now the counterintuitive part — the one that surprised the field. You might assume that more SFT data is always better. It isn't. The landmark demonstration is a 2023 study with a deliberately provocative name: LIMA — "Less Is More for Alignment." The researchers took a strong base model and fine-tuned it on just 1,000 carefully curated prompt-and-response examples — no reinforcement learning, no preference tuning, none of the machinery in the rest of this section — and the result was a startlingly capable assistant, competitive in head-to-head human comparisons with far more heavily trained models, and better than a model fine-tuned on roughly fifty times as many mediocre examples.²⁴

The lesson is blunt: for teaching behavior, a thousand excellent examples beat fifty thousand mediocre ones. And the reason is exactly the D1 thesis restated — the LIMA authors argued that almost all of a model's knowledge is already learned in pre-training, so fine-tuning isn't adding anything; it's just teaching the model the style and format in which to express what it already knows.²⁴ If the job is style, not substance, then a small set of impeccably styled examples does the job better than a giant pile of noisy ones. In post-training, the curation is the product. This is why labs guard their SFT datasets like recipes and why "we have great fine-tuning data" is a more credible claim than "we have lots of it."

The trap: SFT can teach the model to lie. Here's where the D1 thesis stops being a tidy slogan and turns into a sharp, practical warning — and it's the most important thing in this subsection. Naively, fine-tuning looks like a great way to add knowledge: just write training examples full of the facts you want the model to know, and it'll learn them, right? It's worse than that — it's actively dangerous, and a 2024 study pinned down exactly why. The paper asks its question right in the title: "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?" The answer is yes.²⁵

The mechanism is worth feeling. The researchers split fine-tuning examples into two kinds: facts the base model already knew (it just needed coaxing to say them) and facts that were genuinely new to it. Two things happened. First, the model learned the genuinely-new facts slowly and reluctantly — confirming that fine-tuning is bad at installing knowledge. But the second finding is the alarming one: as the model was eventually forced to fit those new facts, its tendency to hallucinate went up — and not just on the new facts, but more broadly.²⁵

Why? Because of what the model is actually learning from a training example that says "Question about an obscure fact → confident, correct answer." When the fact is something the base model already knew, that example teaches a fine, true lesson: retrieve what you know and state it. But when the fact is something the base model didn't know, the model can't learn the fact from one example — what it learns instead is the pattern: "when asked about something I don't know, produce a confident, authoritative-sounding answer anyway." You haven't taught it the fact. You've taught it to bluff. You've used the finishing school to install a habit of confident lying.

This is the dark mirror of D1's thesis. Behavior is trainable; knowledge mostly isn't. So when you try to train knowledge with the behavior-tools, what actually transfers is a behavior — and the behavior you accidentally install is fabrication. The practical takeaway, which we'll cash out in D6: if a startup says "we fine-tuned the model on our proprietary documents so it knows our domain," the honest question is whether those facts were already reachable in the base model (in which case fine-tuning helped surface them) or genuinely new (in which case they may have just trained a more confident hallucinator). Genuinely-new knowledge belongs in pre-training, in continued pre-training (Section C6), or bolted on at runtime via retrieval — not smuggled in through SFT.

So SFT gets you a long way: a model that takes turns, follows instructions, and imitates good answers. But imitation has a ceiling. You can only fine-tune on behavior you can write down as an example — and most of what makes one answer better than another isn't a fact you can demonstrate; it's a preference. That's the wall SFT hits, and climbing it requires a genuinely different idea.

Preference learning — turning taste into a number

Here's the wall. SFT teaches the model to imitate a single good answer. But for most real prompts there's no single right answer — there's a spectrum from worse to better, and the differences are matters of taste: this reply is more helpful, that one is too verbose, this one is subtly condescending, that one nails the tone. You can't write a demonstration of "be tasteful." Taste isn't a fact you can show; it's a judgment you can only render by comparing two options and saying which you prefer.

And it turns out humans are much better at comparing than at creating. Ask a person to write the ideal customer-service reply and you'll get something stilted. Ask them "which of these two replies is better?" and they'll answer instantly and reliably. So the entire trick of preference learning is to stop asking humans to write good answers and start asking them to judge — and then turn those judgments into something a machine can optimize.

Step one: collect comparisons. Show a human a prompt and two of the model's own answers. Ask only: which is better? Do this tens of thousands of times. You now have a pile of preference pairs — for each prompt, a "chosen" answer and a "rejected" answer. No one had to write the perfect response; they only had to point at the better of two.

Step two: train a judge — the reward model. Now the conceptual leap that makes everything downstream possible. You take all those (chosen, rejected) pairs and use them to train a second model — a reward model — whose only job is to look at any prompt-and-answer and output a single number: a score for how good that answer is. In other words, you distill diffuse human taste into a trainable function.

How does a pile of "A beat B" comparisons become a model that outputs scores? This is the one piece of real machinery in this subsection, and it's cleaner than it sounds. The reward model is, underneath, just a classifier — and the thing it's trained to classify is which answer wins. You show it the pair, it produces a score for each, and you train it with one rule: the chosen answer's score should come out higher than the rejected answer's score. Every time it gets the ordering wrong, gradient descent nudges its knobs so that next time the winner scores higher. That's the whole training loop — the same C-core engine, now pointed at "did I rank this pair correctly?"

This was the heart of InstructGPT, the 2022 work that turned a feral GPT-3 into a model that follows instructions — and the direct ancestor of ChatGPT. Its reward model was trained exactly this way: on human pairwise comparisons, with a loss that pushed the preferred answer's score above the rejected one's.²⁶ The payoff is enormous and worth stating plainly: you now have a machine that can score any answer, including answers no human ever judged. Human taste, which was diffuse and slow and expensive, is now a fast, automatic number — and a number is something you can optimize against millions of times.

How taste becomes a number

Humans can't write "good taste," but they can point at it. The reward model turns thousands of those points into one number that scores anything.

Preference pairs

Pair 1 — same prompt, two answers

✓ Chosen — human picks this "Here's how to fix that bug — first check the null…"

✗ Rejected "That's an interesting question, perhaps you could try…"

Pair 2 — same prompt, two answers

✓ Chosen — human picks this "The capital of Australia is Canberra, not Sydney."

✗ Rejected "Sydney is often considered the capital of Australia…"

tens of thousands of pairs →

→

Reward model

Classifier trained on pairs

Input: any prompt + response
Output: one score

Training rule: chosen score > rejected score

Bradley-Terry loss

→

Score any answer

A known-good response (seen in training)

8.1

A mediocre, verbose response

4.3

New — never seen by a human A fresh model-generated response

7.2

Source: Ouyang et al. 2022 (InstructGPT), arXiv:2203.02155. Layout is illustrative; scores are not real outputs.

Step three — and here's where the section pivots. You have a number that scores any answer. The obvious next move is screaming at you: push the model to produce answers that score higher. Make the model chase the reward. But "tune the model to maximize a score, where the model's own outputs are what get scored" is no longer the simple imitation game of SFT. It's a fundamentally different kind of learning — learning from a score on your own attempts rather than from examples to copy — and it has subtle, important failure modes that need their own treatment. That is reinforcement learning, and it is Section E. Hold that thought for one more idea, because in 2023 the field found a shortcut that's worth understanding before you walk into the RL machinery.

DPO: skipping the loop entirely. The reward-model-then-RL pipeline has two heavy stages: train a separate reward model, then run a delicate RL optimization against it. In 2023 a paper called Direct Preference Optimization (DPO) asked a sharp question: do we actually need both? Its answer was a genuinely surprising piece of mathematics. The researchers showed that the entire "train a reward model, then do RL against it" objective can be rewritten as a single classification loss applied directly to the model you care about — no separate reward model, no RL loop at all.²⁷

The intuition, stripped of the algebra: there's a hidden mathematical identity between "a model's own answer-probabilities" and "an implied reward." Because of that identity, you can take your preference pairs and train the model directly with one loss that says — once more — "make the chosen answer more likely than the rejected one." It looks almost exactly like the reward-model training from step two, except the thing being trained is the final model itself, and when you're done, you're done. No reward model to build, no reinforcement-learning loop to stabilize.²⁷ DPO collapsed three stages into one, and a large fraction of open-source preference tuning runs on it precisely because it's so much simpler to get right.

So preference learning gives you two roads from the same fork. Both start from the same insight — taste becomes a number you can train against. DPO takes the shortcut and folds that number back into one supervised loss. The other road keeps the number as an explicit reward and optimizes against it with reinforcement learning — and that road is subtle and important enough to get its own section. You now have a number that scores any answer; pushing the model toward higher scores is a reinforcement-learning problem. Section E opens exactly there. Before we walk through that door, two more parts of post-training earn their place — one about scaling the judging, one about what the whole enterprise quietly costs.

Constitutional AI — when the model grades its own homework

Preference learning has an expensive bottleneck, and it's the humans. Every preference pair in D3 needs a person to read two answers and pick one. To align a frontier model you need hundreds of thousands of these judgments, across every topic and every flavor of bad behavior you want to stamp out. That's an army of human labelers, slow and costly — and for the genuinely nasty stuff (the toxic, dangerous, manipulative outputs you're trying to train out), it means asking people to read a lot of ugliness. The bottleneck isn't the model. It's the supply of human judgment.

The 2022 idea called Constitutional AI asked: what if the model could do most of the judging itself — not arbitrarily, but against a written set of principles? Hence the name: you give the model a short constitution, a plain-language list of rules ("prefer the response that is more helpful and honest," "choose the answer that is least likely to be harmful or to assist in dangerous acts," and so on). Then you let the model use that constitution to critique and improve its own outputs.²⁸

The mechanism is more concrete than "AI labels stuff," and the concreteness is the point. It runs in two phases:

First, self-revision. The model produces an answer. Then — prompted with one principle from the constitution at a time — it's asked to critique its own answer against that principle ("does this response do anything harmful? if so, how?") and then rewrite it to better satisfy the principle. The improved, self-revised answers become a new SFT dataset. The model has, in effect, edited its own homework against a rubric.²⁸
Second, self-judged preferences. Now back to D3's machinery — but with the human judge swapped out. The model generates two candidate answers, and another copy of the model is asked to pick which one better satisfies the constitution. Those AI-generated preferences train the reward model. This is the part the field calls RLAIF — reinforcement learning from AI feedback — the same pipeline as before, but the preference labels come from a model consulting principles instead of from a human.²⁸

What this buys you is scale: the labeling step no longer needs an army of humans, so you can generate vastly more preference data and push much harder on hard-to-cover behavior. What it costs you is a real and honest risk: the model's judgments are only as good as the constitution and as good as the model's own ability to apply it. A vague or incomplete principle produces vague or incomplete alignment, and any blind spot in the judging model gets amplified through the whole pipeline — you're now training on the model's own opinions about its own outputs, errors and all. It scales the judgment; it does not guarantee the judgment is right. That tension — scalable feedback versus trustworthy feedback — is live and unresolved, and it's worth keeping in view whenever a lab describes "AI feedback" in its alignment stack.

The hidden costs — alignment tax and the distillation shortcut

Everything so far has been about what post-training gives you. Now the bill, in two parts, because both explain things you'll hear claimed about real models.

The alignment tax — why "the model got nerfed." Recall the through-line of this whole section: post-training reshapes behavior without adding knowledge. But there's a darker version of that fact. Reshaping behavior can quietly degrade capability. When you push a model hard toward being helpful, harmless, and on-style, it can get worse at some of the raw things it used to do well. The InstructGPT team measured exactly this: their aligned model, optimized on human preferences, regressed on a batch of standard benchmarks compared to the raw base model — it had become more helpful and more polite while becoming measurably worse at certain academic tasks. They named the effect the alignment tax: the cost in raw capability you sometimes pay for better behavior.²⁶

This is a cousin of the catastrophic forgetting we'll meet again in Section E — a network learning a new objective can erode what it already knew.²⁹ Here the new objective is "behave like a good assistant," and the eroded thing is some slice of raw capability. It's the single best explanation for a complaint you'll hear constantly: "the new version got nerfed — it used to be able to do X and now it won't / can't." Sometimes that's nostalgia. But sometimes it's a real alignment tax: a round of behavior-tuning that made the model safer or more on-brand and, as a side effect, dulled an edge. (InstructGPT also showed the tax is partly payable down — mixing a little of the original pre-training objective back into the tuning recovered much of the lost benchmark performance — which is why this is a tax to manage, not an inevitability.)²⁶

Distillation — how the small open models are really made. The second hidden cost is really a hidden shortcut, and it's the secret behind a huge swath of the open-model ecosystem. The original idea, knowledge distillation, dates to 2015: train a small, cheap student model to imitate the outputs of a large, expensive teacher model. The student doesn't learn from raw data — it learns from the teacher's answers, which carry a richer signal than a plain label (the teacher's full sense of "this is 70% likely the right answer, 20% this other one" teaches the student nuance a hard label can't).³⁰

In the post-training era this became the dominant way small open models get good fast. You take a small base model and run SFT (D2) on it — but the "expert answers" in the training set aren't written by humans. They're generated by a bigger, already-aligned model. The big model writes thousands of high-quality demonstrations; the small model imitates them. In one move the student inherits much of the teacher's behavior — its helpfulness, its format, its tone — at a tiny fraction of the cost of building that behavior from scratch with human labelers and reward models.

This is the answer to a question Section C raised and left hanging: how do small players ship surprisingly capable open models without frontier budgets? Often, this is how. They stand on a frontier model's shoulders — post-training a small base model on a large model's outputs — which connects directly to the open-source thread running through this primer. It's a legitimate and powerful technique. It also has a hard ceiling worth naming: distillation transfers behavior, and behavior is downstream of knowledge the student may not have. A distilled small model can sound exactly like its big teacher while lacking the depth to back it up under pressure — the very same "knowledge ceiling is set in pre-training" limit from Section C, wearing a borrowed voice.

The payoff, and the line you must not cross

Step back and look at what post-training actually accomplished. We took a feral text-completer — brilliant, knowledgeable, and useless to talk to — and put it through a finishing school in three escalating moves: imitate good answers (SFT, D2), learn human taste as a number (preference learning, D3), and scale that judgment with written principles (Constitutional AI, D4). The output is the assistant you actually use. And critically, we did all of it with C-core's same gradient-descent loop — measure how wrong, step downhill — just pointed at new data and new measures of "wrong." Same brain. New manners. The promise of D1, paid off.

But the finishing school has a hard limit, and it is the most important caution in this section because nearly every public confusion about AI lives right here:

Aligned is not the same as true. Post-training shapes behavior, not knowledge. It can make a model sound more confident, more helpful, more authoritative, more on-brand — without making a single one of its statements more correct. This is the D2 hallucination trap generalized into a law: the very tools that make a model seem trustworthy (smooth tone, decisive answers, a helpful eagerness to respond) are behavioral, and behavior is exactly what post-training is best at installing. A perfectly aligned model that confidently tells you something false is not malfunctioning. It is doing precisely what it was tuned to do — behave like a helpful expert — on a question where it lacks the knowledge to be one. Alignment polished the delivery; it did nothing for the truth of the content. Trust the manners; verify the facts.

And that lands us at the exact spot where this section hands off. Run through D3 once more and notice what we built but never used: a reward model — a number that scores any answer. D3 gave you the score and then deliberately stopped, because the obvious next move — push the model to produce answers that score higher — turns out not to be the simple imitation of SFT at all. It's a different kind of learning entirely: learning from the score on your own attempts, with no example to copy. That is reinforcement learning, and getting it right — pushing toward the reward without the model breaking, gaming the score, or quietly forgetting how to be a coherent model — is subtle and consequential enough to deserve its own section.

You now have a number that scores any answer. Pushing a model toward higher scores is a reinforcement-learning problem. That problem is Section E.

Section

Reinforcement Learning, in plain English

Everything so far — pre-training, fine-tuning — was the model learning from examples that already existed. Someone wrote the text; the model copied the pattern. Reinforcement learning (RL) is fundamentally different, and the difference is the whole point: there are no examples to copy. The model has to learn from the consequences of its own actions.

That's a big shift, so let's not start with the model at all. Let's start with a dog.

The whole idea, in one analogy: training a dog

You want to teach a dog to sit. You can't explain it. You can't show it a textbook. All you can do is: wait, watch what the dog does, and reward the behavior you like. Dog flops down? No treat. Dog sits? Treat. Over many tries, the dog does more of what gets treats and less of what doesn't. It never gets told the rule — it discovers the rule by chasing the reward.

That is reinforcement learning, entire. Now here's the same picture with the five pieces of jargon labeled — because once you've seen them on the dog, they'll never scare you again:

Same five pieces, on a dog you already understand.

Every term below is the whole field of RL. None of them are new — you just learned them as a kid, teaching a dog to sit.

Policy the plan

The dog's current strategy for getting treats.

Rollout one try

One attempt to sit, start to finish.

Reward R9

Treat or no treat. Just a number — not yet good or bad.

Advantage ▲reinforce ▼suppress

Better or worse than its average try. This is the part that actually teaches.

Explore-Exploit the choice

Try something new vs. repeat what already worked.

This panel is the key. The same five colors and words come back later as the actual training loop — only then, the "dog" is a language model.

The Rosetta Stone for reinforcement learning. Five jargon words, one familiar scene.

Let's take the five pieces one at a time. We'll keep the dog around, and bring in a video game when it helps.

Policy — the player's current strategy

The policy is just the model's current strategy for what to do next. For the dog, it's "given what I'm seeing and hearing, what should my body do?" For a language model, the policy is literally the model itself: given the conversation so far, what's its strategy for choosing the next token?

The entire goal of RL is to improve the policy — to make the strategy better over time. At the start it's bad (the dog flops, the model rambles). Each round of training nudges the strategy toward choices that earn more reward. When a lab says "we did RL on the model," they mean: we ran this loop to upgrade the model's strategy.

Think of it as a video game too: your policy is your current playing style — how good you are at the game right now. A beginner's policy mashes buttons; an expert's policy is refined. RL is the practice that turns one into the other.

Reward — the treat, and the trouble with treats

The reward is the signal that tells the model how good an outcome was. Treat for the dog. Points for the game. For an LLM, the reward might come from that reward model we met in post-training (it scores "how much would a human like this answer?"), or — in the powerful newer setups — from something far more objective: did the code pass the tests? Did the math problem reach the correct final answer?

That last point is quietly enormous, and it's worth flagging now because it explains a lot of the 2025–2026 frontier:

Why math and code are the RL goldmine. In most of life, "was that a good answer?" is fuzzy and needs a human to judge. But for math and code there's an automatic, unarguable reward: the answer is right or wrong, the tests pass or fail. That means you can run RL at massive scale with no humans in the loop, generating millions of attempts and rewarding the ones that work. This is why the models that suddenly got dramatically better at math and coding got there through RL — those domains hand you a perfect treat-dispenser for free.

But rewards are also where RL gets dangerous, and this is the limitation you must understand to evaluate any "we used RL" claim. Whatever you reward, you get — including the loopholes. This is reward hacking (a flavor of Goodhart's Law: when a measure becomes a target, it stops being a good measure). The dog version: if you accidentally treat the dog every time it barks while sitting, you'll train a dog that sits and barks its head off, because barking became part of "what gets treats." The model version: if your reward model slightly prefers longer, more confident-sounding answers, RL will gleefully produce a model that's longer-winded and more confidently wrong — it found the loophole. RL optimizes exactly what you measure, not what you meant.

This is why frontier labs spend so much effort designing rewards that can't be gamed, and why "we did RL and the benchmark went up" should make you ask: did the model get smarter, or did it just learn to please your specific reward? (We'll return to that as a litmus test.)

Rollout — one full attempt at the level

A rollout is one complete attempt, start to finish. The dog's single try at sitting. One full playthrough of a game level. For a language model, a rollout is the model generating a whole answer to a prompt — the entire response, start to "done."

Why give this its own word? Because RL learns by comparing many rollouts. You don't learn much from one attempt. You let the model take a hard problem and try it, say, a hundred different ways (this is where temperature and randomness earn their keep — they make the attempts vary). Some rollouts nail it; some flop. The reward sorts them. And from that spread of "this attempt good, that one bad," the model figures out what to do more of. Rollouts are the raw experience RL learns from — the model's own attempts are its only textbook.

One problem, many attempts.

RL doesn't need a textbook answer — it just needs to know which of its OWN attempts worked, then do more of those.

PROMPT

Solve: 17 × 24 = ?

One prompt → many attempts.

17 × 24 = 408 R 9

▲ reinforce

(17×20)+(17×4) = 408 R 9

▲ reinforce

408 (rounded check) R 8

▲ reinforce

17 × 24 = 388 R 2

▼ suppress

17 + 24 = 41 R 1

▼ suppress

RL learns by comparing many rollouts — the spread is the lesson.

Illustrative rollouts for a math problem. Green checkmarks indicate correct answers, red X's indicate wrong answers.

Advantage — "was that better than my usual?"

Here's the subtle one, and it's the key that makes RL actually work. Suppose the dog sits and gets a treat. Good — but how good? If the dog sits every time and always gets a treat, then this particular sit was nothing special; it's just average. But if the dog usually flops and this time it sat — that sit was a big positive surprise, and that's the moment worth reinforcing hard.

Advantage is exactly this: how much better (or worse) was this attempt compared to what I'd normally expect? Not the raw reward — the surprise in the reward. A rollout that scored above the model's average gets pushed harder ("do more of this!"); one that scored below gets pushed away ("less of that"); one that's exactly average barely moves anything.

Why not just use the raw reward? Because raw scores are noisy and uninformative on their own. A "7 out of 10" means nothing until you know whether 7 is great (you usually get 3s) or disappointing (you usually get 9s). Advantage is the baseline-subtracted signal — it strips out "how hard is this problem in general" and isolates "did this attempt beat my own expectation." That's the clean learning signal. It's why a beginner gamer improves fastest: almost everything they try is "better than my terrible average," so the advantage signal is strong and every small win teaches a lot.

Advantage = reward − baseline

Reward says "this scored 9." Advantage says "this beat your usual 6 — do more of it." The second one is what actually drives learning.

FIXED PROMPT & ROLLOUTS

Solve: What is the capital of France?

Model's current baseline (average score): 6.0

Drag the baseline to see how advantages flip. When your average goes up, the same rollout becomes less impressive.

Interactive advantage calculation. The relative comparison is what drives learning, not the absolute reward scores.

Under the hood, lightly. The famous RL algorithms you'll hear named — PPO (the workhorse from RLHF) and the leaner GRPO that powered DeepSeek's math breakthrough — are, at heart, careful machinery for computing this advantage and nudging the policy by it without lurching too far in one step.⁹ ¹⁰ That "don't lurch too far" guardrail matters: push the model too hard toward the reward in one update and it can break — forgetting its language skills while chasing points (a failure labs informally call drift, related to catastrophic forgetting) — which labs hold back with a leash (a KL penalty) tying the model to its sensible starting point. You don't need the equations. You need the shape: try many times, see which tries beat your average, lean that way — but gently.

Exploration vs. exploitation — the gambler's dilemma

The last piece is the tension that sits underneath all of RL, and it's deeply human. Imagine your favorite restaurant. Every night you face a choice: order the dish you know is great (exploit what works), or try something new on the menu that might be even better — or might be a disappointment (explore). Order the usual forever and you'll never discover the better dish. Gamble every night and you'll eat a lot of bad meals. The art is the balance.

That's exploration vs. exploitation, and every RL system lives or dies by it:

Too much exploitation: the model locks onto the first decent strategy it finds and stops improving. The dog learns one mediocre trick and never discovers it could do better. In RL terms, the policy collapses — every rollout looks the same, there's no variety to learn from, progress flatlines.
Too much exploration: the model thrashes around trying wild things, never consolidating what works, never getting reliably good.

Every step, a choice.

Learn nothing new, or risk everything — RL is the constant art of tuning this dial. Early on, explore boldly. As you get good, exploit what works.

The fundamental tension in reinforcement learning — between safety and discovery.

This is also where you can feel why RL on language is so much harder than RL in a game. In chess, every move is legal-or-not and the board tells you the truth. In language, the space of possible "moves" (sentences) is effectively infinite, the reward is often a fuzzy human judgment, and a model can explore its way straight into eloquent nonsense that fools the reward model. RL gave us the leap in reasoning models — but it's a leap walked on a knife's edge between "discovered something genuinely new" and "found a clever way to cheat the score."

Putting it together — and how to use it as a bullshit detector

Step back and you can now read RL as one clean loop, in five plain words: try, score, compare, lean, repeat. The model (policy) takes many full attempts (rollouts), each earns a reward, advantage measures which attempts beat the model's own average, the strategy leans toward those — gently, on a leash to prevent drift — while balancing exploration against exploitation. Run that loop at scale, with a reward you can trust, and you get the dramatic reasoning gains of the modern era.

Try, score, compare, lean, repeat.

Every modern reasoning model is this loop, run a staggering number of times.

POLICY tries

Current strategy generates one ROLLOUT — an attempt.

REWARD scores it

A number for the attempt — R 9

ADVANTAGE = reward − baseline

Above average ▲ reinforce, below ▼ suppress.

Update POLICY

Nudge the weights toward what beat the baseline. Repeat.

the same machine, again

Five words ran the dog; the same five words run the model — unchanged.

The complete reinforcement learning loop. Policy → Rollouts → Rewards → Advantage → Policy update → Repeat.

And here's the payoff — the reason a layperson should care about any of this. When someone tells you "our model got better because of reinforcement learning," you now own the questions that separate substance from spin:

"What was the reward — and could the model have hacked it?" A trustworthy reward (passing real tests, correct math) is worlds apart from a fuzzy one the model can game by sounding confident.
"Did it get smarter, or just better at your benchmark?" RL optimizes precisely what you measure. A benchmark jump can be real reasoning or a learned loophole. (This is exactly the reward-hacking trap from E3, now in the wild.)
"How did you keep it from drifting?" If they can't speak to keeping the model stable and general while pushing it toward the reward, they may have a model that's brittle outside their narrow test.
"Where did the variety in attempts come from?" No exploration, no genuine learning — just a model polishing what it already knew.

If they have crisp, technical answers, you're likely looking at real work. If they wave their hands and say "we did RL," you now know enough to keep your wallet closed.

PPO, from scratch — how to step toward the reward without falling off the cliff

We just said the whole loop in five words: try, score, compare, lean, repeat. But there's a word doing enormous quiet work in there, and it's lean. Once advantage has told you which attempts beat your average, you have to actually change the policy — nudge the strategy toward the good attempts. And it turns out how hard you nudge is the entire ballgame. Push gently and the model barely learns. Push too hard and the model breaks. PPO is the field's answer to "how hard do I push?" — and once you feel why that question is dangerous, the famous algorithm stops being an acronym and becomes obvious.

Start with the danger, because it's not intuitive. Go back to the restaurant. Suppose one night you try a new dish and it's spectacular — a huge positive advantage, way above your usual meal. The naive move is to overreact: "this is the best thing I've ever eaten, I will now order it every single night and never order anything else." You've just thrown away your whole balanced sense of what's good based on one great data point. Maybe that dish was great that night because the chef was in a mood. Maybe you'd hate it the third time. By lurching all the way over, you didn't just adopt a good idea — you destroyed the rest of your taste in the process.

A model does exactly this if you let it. A single batch of attempts says "longer answers scored higher" — and if you shove the policy hard in that direction in one update, the model can swing so far that it forgets how to write a short answer at all, or how to write coherently, while chasing that one signal. The update meant to make it slightly better makes it dramatically, brittlely worse. This is the cliff. The signal you're learning from is noisy and partial, but the update is permanent — so a big confident step on a small noisy signal is how you wreck a working model.

So the real engineering problem of RL on a language model isn't "find the good attempts." Advantage already did that. The problem is: take a step toward them that is big enough to learn from but small enough that you can't fall off the cliff. You want to nudge, never yank.

That is PPO — Proximal Policy Optimization, the workhorse algorithm behind the original RLHF that made ChatGPT-style models follow instructions.⁶ ⁹ The name is just the idea spelled out: proximal means "stay near where you started." PPO's one trick is a clip — a hard cap on how far a single update is allowed to move the policy. If an attempt had a big positive advantage, PPO says "yes, lean toward it… up to a limit, and not one inch past." Beyond that limit, extra eagerness earns you nothing — the update is clipped flat, so there's no incentive to lurch.⁹ It's the seatbelt that lets you press the accelerator.

The cooking analogy makes it concrete. You taste a sauce; it needs salt. The reward (advantage) says "saltier is better." A reckless cook dumps in the whole shaker — and now it's inedible, overshooting the very signal that was trying to help. A good cook adds a pinch, tastes again, adds another pinch. PPO is the pinch. It refuses to add more than a pinch per taste, no matter how strongly that one taste screamed "MORE SALT." The reward might be right about the direction and badly wrong about the dose — so you trust the direction and distrust the dose. That gap, between which way and how far, is the whole reason PPO exists.

Same direction, very different dose.

Both steps head the exact same way — toward the reward. The only difference is how far. PPO caps the step at the proximal ring and stays safe; the reckless one keeps going until it's out past where the model holds together. Lean, don't lurch.

PPO — same direction, capped dose. The step stops at the proximal ring (the clip). Close enough to home that one noisy signal can't wreck the model.

Reckless — same direction, huge dose. One giant step on a noisy signal carries far past the ring into the danger zone. The update meant to help breaks the model.

Schematic of PPO's clipped step. The "clip" caps how far one update may move the policy, no matter how strong the advantage.

Now — why did a chunk of the field walk away from PPO, and what is this GRPO you keep hearing about? This is the part worth understanding, because it's the lever that let small open labs reach the frontier on a budget, and it shows up directly in how DeepSeek and others priced their breakthroughs.

PPO has a hidden cost. To know whether an attempt "beat the average," PPO trains a second whole model alongside the one you care about — a "value model" (also called a critic) whose only job is to predict the expected reward, so you have a baseline to subtract. Two models, twice the memory, twice the bookkeeping. Expensive. And on math and code — where, remember, you can just try the same problem many times — there's a cheaper baseline sitting in plain sight.

That cheaper baseline is the heart of GRPO — Group Relative Policy Optimization, the method introduced in DeepSeek's DeepSeekMath work.¹⁰ The insight is almost embarrassingly simple, and it's pure E5: you already have a spread of attempts at the same problem (your rollout fan from E4). So why train a separate model to guess the average — just use the actual average of the group of sibling attempts as the baseline. Did this attempt beat the other attempts at this same problem? That's your advantage. No second model. No critic. You judge each try against its own siblings.¹⁰

The restaurant version: instead of hiring a food critic to tell you what a dish "should" score (PPO's value model), six friends each order, and you simply compare each plate to the table's average that night. Cheaper, no critic on payroll, and for problems where you can sample many attempts cheaply, just as informative.

Same question — "did this beat my average?"

PPO trains a second model to guess the average. GRPO just uses the average of the attempts it already has. That cost-cut is a big reason cheap, open math-RL scaled the way it did.

PPO two models

A separate critic model predicts the baseline. The four rollouts are scored against that guessed line.

above line = beat baseline below = worse

GRPO one model, cheaper

No critic. The baseline is the average of the rollouts it already has — the dashed line is just their mean.

above line = beat baseline below = worse

Both panels score the same rollouts the same way — above the line earns advantage, below loses it. The only difference: PPO pays a second model to guess where the line goes; GRPO just draws it through the average it already had. Same safety, no second model.

Schematic comparison. Both clip the step (E8) and keep a KL leash (E9); they differ only in where the advantage baseline comes from.

So the field's arc here is clean: PPO gave us a safe way to step toward a reward without breaking the model — that safety is what made RLHF practical at all.⁶ ⁹ GRPO kept the same safety (it still clips the step, still nudges-not-yanks) but threw out the expensive critic, swapping it for the group baseline you already had lying around — which is exactly why verifiable-reward RL on math and code got cheap enough to run at enormous scale, and why, by 2025–2026, much of the reasoning-model wave was trained with recipes of this kind.¹⁰ What neither can do: invent a good reward, or save you if your reward is gameable. The step is now safe; the target still has to be honest. (Straight back to E3 — the machinery for leaning carefully says nothing about whether you're leaning toward the right thing.)

STATE OF PLAY — June 2026
· GRPO and its variants (the leaner, critic-free recipes) are now the default
  for large-scale RL on verifiable rewards; PPO remains common where a learned
  value model still earns its keep (e.g. some RLHF-from-human-preference work).
· "Did you use a critic or a group baseline?" is a fair, cheap signal of how a
  lab's RL costs actually scale. The specific recipe names will churn; the
  nudge-don't-yank principle underneath them will not.

The KL leash — keeping the model from forgetting how to be a model

PPO's clip stops you from taking one catastrophically big step. But there's a slower, sneakier failure that a step-size cap alone won't catch — and it's the one that should be on your diligence checklist. It's not about any single update being too large. It's about a thousand small, safe-looking updates that all quietly drag the model in the same direction, until it has wandered somewhere terrible.

Here's the failure in its purest form, and it's the dark twin of everything good about RL. Reward pulls the model toward whatever scores well. But "scores well on the reward" and "is still a sensible language model" are not the same thing — and over many rounds, the model can chase the first while losing the second. It learns to produce text that the reward loves and a human finds increasingly unhinged: stilted, repetitive, exploiting some tic of the scorer, eventually collapsing into fluent-looking gibberish that racks up points. It got better at the reward by getting worse at language. The model is, in a real sense, forgetting how to be a model in order to win the game. It's a cousin of what the field has long called catastrophic forgetting — a network learning a new task abruptly erasing what it knew before¹⁴ — but the RL version has its own name: drift. Same family (improvement on one front quietly costing you another), distinct mechanism (here it's reward-chasing pulling the model away from coherent language, not a second task overwriting a first).

This is the same monster from E3 — reward hacking — but seen as motion over time rather than a single loophole. The model isn't jumping off a cliff (PPO handles that). It's wandering away from home, one reasonable-looking step at a time, until it's lost.

The fix is exactly what you'd do with anything prone to wandering off: put it on a leash. Before RL starts, you have a perfectly sensible model — the one that came out of pre-training and fine-tuning, that writes fluent, coherent language. You tie the model-being-trained back to that original, sensible version with a tether, and the tether pulls back whenever the new model drifts too far from how the original would have spoken. The model is free to roam and improve at the task — but it can't wander off the property.

That leash has a name you'll hear in every serious RL conversation: the KL penalty. (KL is just a math measure of "how far apart are these two ways of speaking?" — you do not need the formula; you need the picture of a leash.) Every update now answers to two masters: get more reward, and don't drift too far from the sensible starting model.⁶ ¹⁰ The reward says "go!" The leash says "…but not past here." What survives is improvement that stays recognizably a good language model.

The reward pulls outward; the KL leash ties it to home.

Long enough to learn, short enough not to forget how to talk. Same machine as E8 — now guarding against drift over time instead of one big step.

Reward pulls outward. Every update tugs the model toward whatever scores well — even if “scores well” drifts away from coherent language.

The KL leash pulls home. It tethers the model to the sensible one it started as, so it can roam and improve — but not wander off the property.

The reward pulls the model outward; the KL leash ties it to the sensible model it started as. Echoes E8's lean, don't lurch — same machine, now over time.

Schematic of the KL penalty. "KL" is a measure of how far the trained model's way of speaking has drifted from its starting point.

And here is the craft, the thing that separates a team that knows what it's doing from one that doesn't — because the leash has no free lunch. Pull it too tight and the model is chained to the stake: it can't move, so it can't learn, and your expensive RL run barely improves anything. Let it too loose and you're back to drift — the model slips the leash and wanders off into reward-hacked nonsense. The right tension is not a number you can look up; it depends on the task, the reward, how trustworthy that reward is, how long you train. It is tuned, by people who've felt it go wrong both ways. The leash length is a dial, and knowing where to set it is exactly the kind of hard-won judgment you're paying a frontier team for.

Drift is a tradeoff you can feel, not a setting you can copy.

Too loose, it forgets how to talk. Too tight, it can't learn anything. There's no lookup value for the sweet spot — finding it is the craft.

Task reward

Coherence

leash tension (KL strength) balanced

sweet spot

loosetight

Illustrative tradeoff. Reward and coherence are precomputed functions of the KL strength; the sweet spot is deliberately narrow.

Which lands us back at the bullshit detector from E7, now with sharper teeth. When a team says "we did RL and the benchmark went up," the drift question — "how did you keep it from drifting?" — is no longer vague. You now know precisely what you're probing: Did you leash the model to its sensible self, and did you tune that leash well enough to get real gains without it quietly forgetting how to be a coherent model outside your test? A team with a crisp answer (a real baseline model, a tuned KL, evidence the model stayed general) is doing real work. A team that can't speak to it may have a model that dazzles on their one benchmark and falls apart the moment you take it off the leash — brittle exactly where it matters. The leash isn't a footnote of RL. It's the difference between a model that genuinely learned and one that just learned to game you without you noticing.

Notes & sources

The conceptual backbone above is evergreen. The boxed material below dates, and is fenced off deliberately.

STATE OF PLAY — June 2026
· No single "best" model: GPT-5-series, Claude Opus 4.6/4.7, Gemini 3.1 Pro,
  and DeepSeek (V3.2 / V4) each lead different slices — science reasoning,
  coding, agentic tasks, and price-performance respectively.
· RL on verifiable rewards (math, code) is the dominant frontier lever; open
  labs (DeepSeek, Qwen) reached the frontier largely via cheaper RL recipes
  (e.g. GRPO) rather than sheer scale.
· Reasoning models that "think" before answering (test-time compute) are now
  standard at the frontier, not a novelty.
Specific models/numbers will age fast; the mechanisms above will not.

Primary sources (canonical papers, verified via the Valency academic corpus)

Sennrich, Haddow & Birch, Neural Machine Translation of Rare Words with Subword Units (2015), arXiv:1508.07909 — the BPE subword-tokenization scheme.
Radford et al., Language Models are Unsupervised Multitask Learners (2019, the GPT-2 report) — byte-level BPE, which makes every input (including emoji and unseen symbols) representable.
Mikolov et al., Efficient Estimation of Word Representations in Vector Space (2013), arXiv:1301.3781 — static word embeddings (word2vec) and the king−man+woman≈queen geometry. (A static-embedding result; LLM token embeddings use the same near-means-similar idea but are contextual — see source 4.)
Vaswani et al., Attention Is All You Need (2017), arXiv:1706.03762 — the Transformer and self-attention.
Hoffmann et al., Training Compute-Optimal Large Language Models (2022), arXiv:2203.15556 — the "Chinchilla" compute-optimal scaling result.
Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (2022), arXiv:2203.02155 — InstructGPT / RLHF.
Rafailov et al., Direct Preference Optimization: Your Language Model Is Secretly a Reward Model (2023), arXiv:2305.18290 — DPO.
Bai et al., Constitutional AI: Harmlessness from AI Feedback (2022), arXiv:2212.08073 — Constitutional AI / RLAIF.
Schulman et al., Proximal Policy Optimization Algorithms (2017), arXiv:1707.06347 — PPO.
Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024), arXiv:2402.03300 — GRPO.

Sourced for the boxed/dramatic claims (per research-discipline rule on dramatic numbers)

Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages (2023), arXiv:2305.15425 — some languages fragment into several times more tokens than English.
Epoch AI, Tracking frontier training compute & cost — estimate that frontier pre-training runs cost tens to hundreds of millions of dollars. (Estimate; figure moves over time.)

Schulman et al., Trust Region Policy Optimization (2015), arXiv:1502.05477 — TRPO; the explicit trust-region constraint that PPO later simplified into a clip (E8).
French, Catastrophic Forgetting in Connectionist Networks (1999), Trends in Cognitive Sciences 3(4):128–135 — the classic study of networks forgetting prior learning under new training. Cited in E9 as the cousin of RL “drift,” not an identity. (French names sequential-task interference; RL drift is reward-over-optimization pulling a model from coherent language. The KL-penalty fix is sourced to fn-6 and fn-10.)
Hochreiter & Schmidhuber, Long Short-Term Memory (1997), Neural Computation 9(8):1735–1780, DOI:10.1162/neco.1997.9.8.1735 — the LSTM, the gated RNN that was the leading approach to sequence modeling before the Transformer (B1).
Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022), arXiv:2205.14135 — a hardware-aware reorganization of attention with identical mathematical results; the canonical example of architecture/hardware co-design (B5).
Gu & Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023), arXiv:2312.00752 — representative of the sub-quadratic “after the Transformer” research frontier referenced in B6 and the State of Play box.
Supporting context (B): Kaplan et al., Scaling Laws for Neural Language Models (2020), arXiv:2001.08361 — the earlier scaling-laws result underpinning “more compute → better models”; and Tay et al., Efficient Transformers: A Survey (2020), arXiv:2009.06732 — survey of efficient-attention approaches attacking the quadratic cost.
Penedo et al., The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale (2024), arXiv:2406.17557 — large-scale web-data curation; careful filtering/dedup of pre-training data measurably improves downstream model quality (C3).
Kaplan et al., Scaling Laws for Neural Language Models (2020), arXiv:2001.08361 — the influential earlier scaling laws that leaned toward prioritizing model size; the view Chinchilla later corrected (contrast, C4).
Hoffmann et al., Training Compute-Optimal Large Language Models (2022), arXiv:2203.15556 — “Chinchilla.” For a fixed compute budget, model size and training tokens should scale together; most prior large models were undertrained. Trained 400+ models; the 70B Chinchilla beat the 280B Gopher, 175B GPT-3, and 530B Megatron-Turing NLG (C4).
Villalobos et al. (Epoch AI), Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data (2022, updated 2024), arXiv:2211.04325 — the canonical “data wall” analysis: a few hundred trillion tokens of high-quality public text, projected exhaustion in the second half of the 2020s (C4, C6).
Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget (2023), arXiv:2305.17493 — the “model collapse” concern in the June-2026 State-of-play box: careless training on model-generated data can degrade quality.
Zhou et al., LIMA: Less Is More for Alignment (2023), arXiv:2305.11206 (NeurIPS 2023). Fine-tuned a strong 65B base model on only 1,000 carefully curated prompt–response pairs with standard supervised loss — no RLHF, no preference modeling — and produced an assistant competitive in human evaluations with far more heavily trained models, and stronger than a model trained on ~50× more (noisier) data. The authors' "Superficial Alignment Hypothesis" — that a model's knowledge is overwhelmingly acquired in pre-training and alignment mainly teaches style and format for expressing it — is the source for D2's "data quality is the product" and the D1 behavior≠knowledge thesis.
Gekhman et al., Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? (2024), arXiv:2405.05904. In a controlled closed-book QA setup, examples were split into facts the base model already knew vs. facts genuinely new to it. Findings: (1) the model learns genuinely-new facts slowly, confirming fine-tuning is poor at installing knowledge; (2) as it does fit the new-knowledge examples, its hallucination rate rises (roughly linearly) — because fine-tuning on unknown facts teaches the behavior of answering confidently regardless of whether the model actually knows. Direct source for the D2 "SFT can teach confident lying" trap.
Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (2022), arXiv:2203.02155 — InstructGPT, the direct ancestor of ChatGPT. Source for: the reward model trained on human pairwise comparisons with a Bradley–Terry/logistic loss (D3); and the alignment tax — the measured regression on standard NLP benchmarks (e.g. SQuAD, DROP, HellaSwag, WMT) after preference-tuning, which the paper shows is largely recoverable by mixing pre-training updates back in ("PPO-ptx") (D5).
Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023), arXiv:2305.18290. Shows the KL-constrained RLHF objective (reward model + RL) can be reparameterized into a single binary-classification loss applied directly to the policy — eliminating the separate reward model and the RL loop. Training reduces to a logistic loss on preference pairs that raises the log-probability of the chosen response over the rejected one. Source for D3's "DPO collapses the RL loop into one classification loss."
Bai et al., Constitutional AI: Harmlessness from AI Feedback (2022), arXiv:2212.08073 (Anthropic). Two-phase method: (1) a supervised self-critique-and-revise stage where the model critiques its own outputs against written constitutional principles and rewrites them; (2) an RLAIF stage where a model — consulting the constitution — generates the preference labels that train the reward model, replacing the human labelers of standard RLHF. Source for D4's mechanism and the "scalable but only as good as the constitution + judging model" caveat.
French, Catastrophic Forgetting in Connectionist Networks (1999), Trends in Cognitive Sciences 3(4):128–135 — the classic account of networks eroding prior learning when trained on a new objective. Cited in D5 as the cousin of the alignment tax (behavior-tuning degrading raw capability), not an identity; the operational source for the alignment-tax claim itself is [^26] (InstructGPT). This is the same reference reused in Section E for RL "drift."
Hinton, Vinyals & Dean, Distilling the Knowledge in a Neural Network (2015), arXiv:1503.02531 (NIPS 2014 Deep Learning Workshop). The canonical knowledge-distillation paper: a small student network is trained to match a large teacher's softened output distribution, inheriting the teacher's behavior far more cheaply than learning from scratch. Source for D5's account of distillation as the dominant way small open models are post-trained — SFT on a larger, already-aligned model's generated outputs.

Supporting: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020, arXiv:2005.14165); Wei et al., Chain-of-Thought Prompting Elicits Reasoning in LLMs (2022, arXiv:2201.11903).