MicroGPT 交互式解释

原文

Andrej Karpathy wrote a 200-line Python script that trains and runs a GPT from scratch, with no libraries, no dependencies, just raw Python. The script contains the complete algorithm that powers LLMs like ChatGPT. Everything else is just efficiency.

Let's walk through it piece by piece and watch each part work. Andrej did a walkthrough on his blog, but here I take a more visual approach, tailored for beginners.

The dataset

The model trains on 32,000 human names, one per line: emma, olivia, ava, isabella, sophia... Each name is a document. The model's job: learn the statistical patterns in these names and generate plausible new ones that sound like they could be real.

By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset. The model has learned which characters tend to follow which, which sounds are common at the start vs. the end, and how long a typical name runs. From ChatGPT's perspective, your conversation is just a funny-looking document. When you type a prompt, the model's response is a statistical document completion.

Numbers, not letters

Neural networks work with numbers, not characters. So we need a way to convert text into a sequence of integers and back. The simplest possible tokenizer assigns one integer to each unique character in the dataset. The 26 lowercase letters get ids 0 through 25, and we add one special token called BOS (Beginning of Sequence) with id 26 that marks where a name starts and ends.

Type a name below and watch it get tokenized. Each character maps to its integer id, and BOS tokens wrap both ends:

The integer values themselves have no meaning. Token 4 isn't "more" than token 2. Each token is just a distinct symbol, like assigning a different color to each letter. Production tokenizers like tiktoken (used by GPT-4) work on chunks of characters for efficiency, giving a vocabulary of ~100,000 tokens, but the principle is identical.

The prediction game

Here's the core task: given the tokens we've seen so far, predict what comes next. We slide through the sequence one position at a time. At position 0, the model sees only BOS and must predict the first letter. At position 1, it sees BOS and the first letter and must predict the second letter. And so on.

Step through the sequence below and watch the context grow while the target shifts forward:

Each step produces one training example: the context on the left is the input, the green token on the right is what the model should predict. For the name "emma", that's five input-target pairs. This sliding window is how all language models train, including ChatGPT.

From scores to probabilities

At each position, the model outputs 27 raw numbers, one per possible next token. These numbers (called Logits) can be anything: positive, negative, large, small. We need to convert them into probabilities that are positive and sum to 1. Softmax does this by exponentiating each score and dividing by the total.

Adjust the logits below and watch the probability distribution change. Notice how one large logit dominates: the exponential amplifies differences.

Here's the actual softmax code from microgpt. Step through it to see the intermediate values at each line:

The subtraction of the max value before exponentiating doesn't change the result mathematically (dividing numerator and denominator by the same constant cancels out) but prevents overflow. Without it, exp(100) would produce infinity.

Measuring surprise

How wrong was the prediction? We need a single number that captures "the model thought the correct answer was unlikely." If the model assigns probability 0.9 to the correct next token, the loss is low (0.1). If it assigns probability 0.01, the loss is high (4.6). The formula is $-\log(p)$

Drag the slider to adjust the probability of the correct token and watch the loss change:

The curve has two properties that make it useful. First, it's zero when the model is perfectly confident in the right answer ( $p = 1$

Tracking every calculation

To improve, the model needs to answer: "for each of my 4,192 parameters, if I nudge it up by a tiny amount, does the loss go up or down, and by how much?" Backpropagation computes this by walking the computation backward, applying the chain rule at each step.

Every mathematical operation (add, multiply, exp, log) is a node in a graph. Each node remembers its inputs and knows its local derivative. The backward pass starts at the loss (where the Gradient is trivially 1.0) and multiplies local derivatives along every path back to the inputs.

Step through the forward pass, then the backward pass for a small example where $L = a \cdot b + a$

Now step through the actual Value class code. Watch how each operation records its children and local gradients, then how backward() walks the graph in reverse, accumulating gradients:

Notice that $a$

This is the same algorithm that PyTorch's loss.backward() runs, operating on scalars instead of tensors. Same algorithm, just smaller and slower.

From IDs to meaning

We know how to measure error and how to trace that error back to every parameter. Now let's build the model itself, starting with how it represents tokens.

A raw token id like 4 is just an index. The model can't do math with a bare integer. So each token looks up a learned vector (a list of 16 numbers) from an Embedding table. Think of it as each token having a 16-dimensional "personality" that the model can adjust during training.

Position matters too. The letter "a" at position 0 plays a different role than "a" at position 4. So there's a second embedding table indexed by position. The token embedding and position embedding are added together to form the input to the rest of the network.

Click a token below to see its embedding vectors and how they combine:

The embedding values start as small random numbers and get tuned during training. After training, tokens that behave similarly (like vowels) tend to end up with similar embedding vectors. The model learns these representations from scratch, with no prior knowledge of what a vowel is.

How tokens talk to each other

This is how transformers work. At each position, the model needs to gather information from previous positions. It does this through Attention: each token produces three vectors from its embedding.

A Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what information do I offer if selected?"). The query at the current position is compared against all keys from previous positions via dot product. High dot product means high relevance. Softmax converts these scores into attention weights, and the weighted sum of values is the output.

Explore the attention weights below. Each cell shows how much one position attends to another. Switch between the four attention heads to see different patterns:

The gray region in the upper-right is the causal mask. Position 2 can't attend to position 4 because position 4 hasn't happened yet. This is what makes the model Autoregressive: each position only sees the past.

Different heads learn different patterns. One head might attend strongly to the most recent token. Another might focus on the BOS token (to remember "we're generating a name"). A third might look for vowels. The four heads run in parallel, each operating on a 4-dimensional slice of the 16-dimensional embedding, and their outputs are concatenated and projected back to 16 dimensions.

The full picture

The model pipes each token through: embed, normalize, attend, add residual, normalize, MLP, add residual, project to output logits. The MLP (multilayer perceptron) is a two-layer feed-forward network: project up to 64 dimensions, apply ReLU (zero out negatives), project back to 16. If attention is how tokens communicate, the MLP is where each position thinks independently.

Step through the pipeline for one token and watch data flow through each stage:

Here's the actual gpt() function from microgpt. Step through to see the code executing line by line, with the intermediate vector at each stage:

The residual connections (the "Add" steps) are load-bearing. Without them, gradients would shrink to near-zero by the time they reach the early layers, and training would stall. The residual connection gives gradients a shortcut, which is why deep networks can train at all.

RMSNorm (root-mean-square normalization) rescales each vector to have unit root-mean-square. This prevents activations from growing or shrinking as they pass through the network, which stabilizes training. GPT-2 used LayerNorm; RMSNorm is simpler and works just as well.

Learning

The training loop repeats 1,000 times: pick a name, tokenize it, run the model forward over every position, compute the cross-entropy loss at each position, average the losses, backpropagate to get gradients for every parameter, and update the parameters to make the loss a bit lower.

The optimizer is Adam, which is smarter than naive gradient descent. It maintains a running average of each parameter's recent gradients (momentum) and a running average of the squared gradients (adaptive learning rate). Parameters that have been getting consistent gradients take larger steps. Parameters that have been oscillating take smaller ones.

Watch the loss decrease over 1,000 training steps. The model starts at ~3.3 (random guessing among 27 tokens: $-\log(1/27) \approx 3.3$

Step through the code for one complete training iteration. Watch it pick a name, run the forward pass at each position, compute the loss, run backward, and update the parameters:

Making things up

Once training is done, Inference is straightforward. Start with BOS, run the forward pass, get 27 probabilities, randomly sample one token, feed it back in, and repeat until the model outputs BOS again (meaning "I'm done") or we hit the maximum length.

Temperature controls how we sample. Before softmax, we divide the logits by the temperature. A temperature of 1.0 samples directly from the learned distribution. Lower temperatures sharpen the distribution (the model picks its top choices more often). Higher temperatures flatten it (more diverse but potentially less coherent output).

Adjust the temperature and watch the probability distribution change:

Step through the inference loop to see a name being generated character by character. At each step, the model runs forward, produces probabilities, and samples the next token:

A temperature approaching 0 would always pick the highest-probability token (greedy decoding). This produces the most "average" output. A temperature of 1.0 matches what the model actually learned. Values above 1.0 inject extra randomness, which can produce creative outputs but also nonsense. The sweet spot for names is around 0.5.

Everything else is efficiency

This 200-line script contains the complete algorithm. Between this and ChatGPT, nothing changes conceptually. The differences are all engineering: trillions of tokens instead of 32,000 names. Subword tokenization (100K vocabulary) instead of characters. Tensors on GPUs instead of scalar Value objects in Python. Hundreds of billions of parameters instead of 4,192. Hundreds of layers instead of one. Training across thousands of GPUs for months.

But the core loop is the same: tokenize, embed, attend, compute, predict the next token, measure surprise, walk the gradients backward, nudge the parameters. Repeat.