MicroGPT 交互式解释
Microgpt explained interactively

原始链接: https://growingswe.com/blog/microgpt

## MicroGPT:200行代码实现的LLM Andrej Karpathy 使用200行Python代码创建了一个完全可用的GPT语言模型,展示了像ChatGPT这样的模型背后的核心原理,*无需*依赖外部库。该模型从32,000个示例的数据集中学习生成合理的人名。 过程首先将名称转换为数字标记——为每个字符分配一个ID,以及一个“序列开始”标记。然后,模型预测序列中的下一个标记,学习字符之间的统计关系。这种预测依赖于“注意力”机制,允许模型权衡输入不同部分的重要性。 至关重要的是,模型通过反向传播学习,使用交叉熵方法调整其参数以最小化预测误差(损失)。这涉及计算梯度并通过像Adam这样的优化器更新参数。 虽然这个micro-GPT使用简单的Python标量,但其底层算法与更大的LLM相同——这只是规模的问题。差异在于利用GPU、更大的数据集、更复杂的标记化以及大幅增加的模型大小(参数和层)。最终,核心循环保持不变:预测下一个标记,衡量误差,并完善模型。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交 登录 Microgpt 交互式解释 (growingswe.com) 19 分,来自 growingswe 1 小时前 | 隐藏 | 过去的 | 收藏 | 1 条评论 帮助 politelemon 14 分钟前 | 下一个 [–] > 训练结束时,模型会生成“kamon”、“karai”、“anna”和“anton”等名字。这些名字都不是从数据集中复制的。 你好,我能在数据集中看到 kamon、karai、anna 和 anton,使用其他名字可能更好:https://raw.githubusercontent.com/karpathy/makemore/988aa59/...回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Andrej Karpathy wrote a 200-line Python script that trains and runs a GPT from scratch, with no libraries, no dependencies, just raw Python. The script contains the complete algorithm that powers LLMs like ChatGPT. Everything else is just efficiency.

Let's walk through it piece by piece and watch each part work. Andrej did a walkthrough on his blog, but here I take a more visual approach, tailored for beginners.

The dataset

The model trains on 32,000 human names, one per line: emma, olivia, ava, isabella, sophia... Each name is a document. The model's job: learn the statistical patterns in these names and generate plausible new ones that sound like they could be real.

By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset. The model has learned which characters tend to follow which, which sounds are common at the start vs. the end, and how long a typical name runs. From ChatGPT's perspective, your conversation is just a funny-looking document. When you type a prompt, the model's response is a statistical document completion.

Numbers, not letters

Neural networks work with numbers, not characters. So we need a way to convert text into a sequence of integers and back. The simplest possible tokenizer assigns one integer to each unique character in the dataset. The 26 lowercase letters get ids 0 through 25, and we add one special token called BOS (Beginning of Sequence) with id 26 that marks where a name starts and ends.

Type a name below and watch it get tokenized. Each character maps to its integer id, and BOS tokens wrap both ends:

BOS26e4m12m12a0BOS26chartoken

The integer values themselves have no meaning. Token 4 isn't "more" than token 2. Each token is just a distinct symbol, like assigning a different color to each letter. Production tokenizers like tiktoken (used by GPT-4) work on chunks of characters for efficiency, giving a vocabulary of ~100,000 tokens, but the principle is identical.

The prediction game

Here's the core task: given the tokens we've seen so far, predict what comes next. We slide through the sequence one position at a time. At position 0, the model sees only BOS and must predict the first letter. At position 1, it sees BOS and the first letter and must predict the second letter. And so on.

Step through the sequence below and watch the context grow while the target shifts forward:

sequenceBOSinputetargetmmaBOS

Each step produces one training example: the context on the left is the input, the green token on the right is what the model should predict. For the name "emma", that's five input-target pairs. This sliding window is how all language models train, including ChatGPT.

From scores to probabilities

At each position, the model outputs 27 raw numbers, one per possible next token. These numbers (called Logits) can be anything: positive, negative, large, small. We need to convert them into probabilities that are positive and sum to 1. Softmax does this by exponentiating each score and dividing by the total.

Adjust the logits below and watch the probability distribution change. Notice how one large logit dominates: the exponential amplifies differences.

a22.1%b8.1%c4.9%d1.8%e60.0%other3.0%probability

Here's the actual softmax code from microgpt. Step through it to see the intermediate values at each line:

PYTHON

1def softmax(logits):

2 max_val = max(val.data for val in logits)

3 exps = [(val - max_val).exp()

4 for val in logits]

5 total = sum(exps)

6 return [e / total for e in exps]

VALUES

a1.200e2.800m0.500n1.800BOS-0.300

The subtraction of the max value before exponentiating doesn't change the result mathematically (dividing numerator and denominator by the same constant cancels out) but prevents overflow. Without it, exp(100) would produce infinity.

Measuring surprise

How wrong was the prediction? We need a single number that captures "the model thought the correct answer was unlikely." If the model assigns probability 0.9 to the correct next token, the loss is low (0.1). If it assigns probability 0.01, the loss is high (4.6). The formula is log(p)-\log(p)

Drag the slider to adjust the probability of the correct token and watch the loss change:

01234500.250.50.751probability of correct tokenloss1.20

The curve has two properties that make it useful. First, it's zero when the model is perfectly confident in the right answer (p=1p = 1

Tracking every calculation

To improve, the model needs to answer: "for each of my 4,192 parameters, if I nudge it up by a tiny amount, does the loss go up or down, and by how much?" Backpropagation computes this by walking the computation backward, applying the chain rule at each step.

Every mathematical operation (add, multiply, exp, log) is a node in a graph. Each node remembers its inputs and knows its local derivative. The backward pass starts at the loss (where the Gradient is trivially 1.0) and multiplies local derivatives along every path back to the inputs.

Step through the forward pass, then the backward pass for a small example where L=ab+aL = a \cdot b + a

FORWARD|BACKWARD

a2.0bcL*+

Now step through the actual Value class code. Watch how each operation records its children and local gradients, then how backward() walks the graph in reverse, accumulating gradients:

PYTHON

1class Value:

2 def __init__(self, data, children=(), local_grads=()):

3 self.data = data

4 self.grad = 0

5 self._children = children

6 self._local_grads = local_grads

7 

8 def __add__(self, other):

9 return Value(self.data + other.data,

10 (self, other), (1, 1))

11 

12 def __mul__(self, other):

13 return Value(self.data * other.data,

14 (self, other), (other.data, self.data))

15 

16 def backward(self):

17 # topological sort

18 topo = []

19 visited = set()

20 def build_topo(v):

21 if v not in visited:

22 visited.add(v)

23 for child in v._children:

24 build_topo(child)

25 topo.append(v)

26 build_topo(self)

27 self.grad = 1

28 for v in reversed(topo):

29 for child, lg in zip(v._children, v._local_grads):

30 child.grad += lg * v.grad

Notice that aa has a gradient of 4.0, not 3.0. That's because aa is used in two places: once in the multiplication ((ab)/a=b=3\partial(a \cdot b)/\partial a = b = 3

This is the same algorithm that PyTorch's loss.backward() runs, operating on scalars instead of tensors. Same algorithm, just smaller and slower.

From IDs to meaning

We know how to measure error and how to trace that error back to every parameter. Now let's build the model itself, starting with how it represents tokens.

A raw token id like 4 is just an index. The model can't do math with a bare integer. So each token looks up a learned vector (a list of 16 numbers) from an Embedding table. Think of it as each token having a 16-dimensional "personality" that the model can adjust during training.

Position matters too. The letter "a" at position 0 plays a different role than "a" at position 4. So there's a second embedding table indexed by position. The token embedding and position embedding are added together to form the input to the rest of the network.

Click a token below to see its embedding vectors and how they combine:

token emb-0.080.04-0.010.06-0.030.07-0.050.02+pos emb-0.010.06-0.030.02-0.050.04-0.070.01=combined-0.090.10-0.040.08-0.080.11-0.120.03d0d1d2d3d4d5d6d7showing 8 of 16 dimensions

The embedding values start as small random numbers and get tuned during training. After training, tokens that behave similarly (like vowels) tend to end up with similar embedding vectors. The model learns these representations from scratch, with no prior knowledge of what a vowel is.

How tokens talk to each other

This is how transformers work. At each position, the model needs to gather information from previous positions. It does this through Attention: each token produces three vectors from its embedding.

A Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what information do I offer if selected?"). The query at the current position is compared against all keys from previous positions via dot product. High dot product means high relevance. Softmax converts these scores into attention weights, and the weighted sum of values is the output.

Explore the attention weights below. Each cell shows how much one position attends to another. Switch between the four attention heads to see different patterns:

BOSemmaBOSBOSemmaBOSkey positionquery100356515305581235455101530404812182830

The gray region in the upper-right is the causal mask. Position 2 can't attend to position 4 because position 4 hasn't happened yet. This is what makes the model Autoregressive: each position only sees the past.

Different heads learn different patterns. One head might attend strongly to the most recent token. Another might focus on the BOS token (to remember "we're generating a name"). A third might look for vowels. The four heads run in parallel, each operating on a 4-dimensional slice of the 16-dimensional embedding, and their outputs are concatenated and projected back to 16 dimensions.

The full picture

The model pipes each token through: embed, normalize, attend, add residual, normalize, MLP, add residual, project to output logits. The MLP (multilayer perceptron) is a two-layer feed-forward network: project up to 64 dimensions, apply ReLU (zero out negatives), project back to 16. If attention is how tokens communicate, the MLP is where each position thinks independently.

Step through the pipeline for one token and watch data flow through each stage:

TokenTokenEmbedPosEmbedAddRMSNormAttnAddRMSNormMLPAddOutputLogitsvector dimension: 16 (except output: 27)

Here's the actual gpt() function from microgpt. Step through to see the code executing line by line, with the intermediate vector at each stage:

PYTHON

1def gpt(token_id, pos_id, keys, values):

2 tok_emb = state_dict["wte"][token_id]

3 pos_emb = state_dict["wpe"][pos_id]

4 x = [t + p for t, p in zip(tok_emb, pos_emb)]

5 x = rmsnorm(x)

6 

7 for li in range(n_layer):

8 x_residual = x

9 x = rmsnorm(x)

10 q = linear(x, attn_wq)

11 k = linear(x, attn_wk)

12 v = linear(x, attn_wv)

13 keys[li].append(k)

14 values[li].append(v)

15 # multi-head attention

16 for h in range(n_head):

17 attn_logits = [q_h . k_h[t] / sqrt(d)

18 for t in range(len(k_h))]

19 attn_weights = softmax(attn_logits)

20 head_out = weighted_sum(attn_weights, v_h)

21 x = linear(x_attn, attn_wo)

22 x = [a + b for a, b in zip(x, x_residual)]

23 

24 x_residual = x

25 x = rmsnorm(x)

26 x = linear(x, mlp_fc1)

27 x = [xi.relu() for xi in x]

28 x = linear(x, mlp_fc2)

29 x = [a + b for a, b in zip(x, x_residual)]

30 

31 logits = linear(x, lm_head)

32 return logits

The residual connections (the "Add" steps) are load-bearing. Without them, gradients would shrink to near-zero by the time they reach the early layers, and training would stall. The residual connection gives gradients a shortcut, which is why deep networks can train at all.

RMSNorm (root-mean-square normalization) rescales each vector to have unit root-mean-square. This prevents activations from growing or shrinking as they pass through the network, which stabilizes training. GPT-2 used LayerNorm; RMSNorm is simpler and works just as well.

Learning

The training loop repeats 1,000 times: pick a name, tokenize it, run the model forward over every position, compute the cross-entropy loss at each position, average the losses, backpropagate to get gradients for every parameter, and update the parameters to make the loss a bit lower.

The optimizer is Adam, which is smarter than naive gradient descent. It maintains a running average of each parameter's recent gradients (momentum) and a running average of the squared gradients (adaptive learning rate). Parameters that have been getting consistent gradients take larger steps. Parameters that have been oscillating take smaller ones.

Watch the loss decrease over 1,000 training steps. The model starts at ~3.3 (random guessing among 27 tokens: log(1/27)3.3-\log(1/27) \approx 3.3

2.53.03.5random guessingtraining step

generated names

xqbzjf

mwplkt

gvrcnx

Step through the code for one complete training iteration. Watch it pick a name, run the forward pass at each position, compute the loss, run backward, and update the parameters:

PYTHON

1# Pick a document and tokenize it

2doc = docs[step % len(docs)]

3tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

4 

5# Forward pass: predict each next token

6keys, values = [[] for _ in range(n_layer)], [...]

7losses = []

8for pos_id in range(n):

9 token_id = tokens[pos_id]

10 target_id = tokens[pos_id + 1]

11 logits = gpt(token_id, pos_id, keys, values)

12 probs = softmax(logits)

13 loss_t = -probs[target_id].log()

14 losses.append(loss_t)

15loss = (1/n) * sum(losses)

16 

17# Backward pass

18loss.backward()

19 

20# Adam optimizer update

21for i, p in enumerate(params):

22 m[i] = beta1 * m[i] + (1 - beta1) * p.grad

23 v[i] = beta2 * v[i] + (1 - beta2) * p.grad**2

24 m_hat = m[i] / (1 - beta1 ** (step+1))

25 v_hat = v[i] / (1 - beta2 ** (step+1))

26 p.data -= lr * m_hat / (v_hat**0.5 + eps)

27 p.grad = 0

Making things up

Once training is done, Inference is straightforward. Start with BOS, run the forward pass, get 27 probabilities, randomly sample one token, feed it back in, and repeat until the model outputs BOS again (meaning "I'm done") or we hit the maximum length.

Temperature controls how we sample. Before softmax, we divide the logits by the temperature. A temperature of 1.0 samples directly from the learned distribution. Lower temperatures sharpen the distribution (the model picks its top choices more often). Higher temperatures flatten it (more diverse but potentially less coherent output).

Adjust the temperature and watch the probability distribution change:

original distribution from the modelabcdefghijklmnopqrstuvwxyz28%

Step through the inference loop to see a name being generated character by character. At each step, the model runs forward, produces probabilities, and samples the next token:

PYTHON

1temperature = 0.5

2keys, values = [[] for _ in range(n_layer)], [...]

3token_id = BOS

4sample = []

5 

6for pos_id in range(block_size):

7 logits = gpt(token_id, pos_id, keys, values)

8 probs = softmax([l / temperature for l in logits])

9 token_id = random.choices(

10 range(vocab_size),

11 weights=[p.data for p in probs])[0]

12 if token_id == BOS:

13 break

14 sample.append(uchars[token_id])

15 

16print("".join(sample))

A temperature approaching 0 would always pick the highest-probability token (greedy decoding). This produces the most "average" output. A temperature of 1.0 matches what the model actually learned. Values above 1.0 inject extra randomness, which can produce creative outputs but also nonsense. The sweet spot for names is around 0.5.

Everything else is efficiency

This 200-line script contains the complete algorithm. Between this and ChatGPT, nothing changes conceptually. The differences are all engineering: trillions of tokens instead of 32,000 names. Subword tokenization (100K vocabulary) instead of characters. Tensors on GPUs instead of scalar Value objects in Python. Hundreds of billions of parameters instead of 4,192. Hundreds of layers instead of one. Training across thousands of GPUs for months.

But the core loop is the same: tokenize, embed, attend, compute, predict the next token, measure surprise, walk the gradients backward, nudge the parameters. Repeat.

联系我们 contact @ memedata.com