能够相加两个10位数的最小变压器。

能够相加两个10位数的最小变压器。
Smallest transformer that can add two 10-digit numbers

原始链接: https://github.com/anadim/AdderBoard

## 极简Transformer加法挑战：总结这项挑战旨在发现能够准确加两个10位数字的最小Transformer模型（在包含10,000个数字的测试集上达到>=99%的准确率）。该项目最初由Claude Code和Codex发起，并在社区贡献下取得了显著进展，大幅减少了模型尺寸。追踪两大类模型：**训练**模型（通过SGD等算法学习权重）和**手工编码**模型（分析性定义权重，证明架构能力）。目前领先的模型非常小巧——一个手工编码的解决方案仅用**36个参数**就实现了100%的准确率，而一个训练模型则利用秩3分解，以**311个参数**达到了99.999%的准确率。推动这些结果的关键技术包括低秩投影、分解嵌入、自定义位置编码（如ALiBi）和课程学习。核心约束是真正的自回归Transformer——自注意力是*必需的*，并且进位必须从模型的生成过程中产生，而不是显式代码。这项挑战突出了大约800个参数附近的“参数悬崖”，并展示了秩3分解等技术对于可学习解决方案的强大作用。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录能够相加两个10位数字的最小Transformer (github.com/anadim) 13 分，ks2048 1小时前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 amelius 0分钟前 [–] > 简而言之：如果你可以替换不同的权重集，并为不同的任务使用完全相同的推理代码，那么你的设置就是合法的。如果推理代码与算法不可分割，则不是。我想知道他们为什么不自己编写代码，这样设计上就可以专注于模型。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.

This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.

Maintained by Dimitris Papailiopoulos (@dimitrispapail).

We track two categories:

Trained — weights learned from data by any training algorithm (SGD, Adam, evolutionary search, etc.). The algorithm must be generic — it should work with any model and dataset, not just this specific problem. This encourages creative ideas around data format, tokenization, curriculum learning, and architecture search.
Hand-coded — weights set analytically. This is a constructive proof that the architecture can represent addition, regardless of whether SGD would find it.

Both are valid. Both are interesting.

Hand-Coded Weights (Constructive Proofs)

Rank	Params	Accuracy	Author	Built with	Architecture	Key Tricks	Link
1	36	100%	alexlitz		2L decoder, d=5, 5h+1h	ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64	gist
2	40	100%	Wonderfall (@w0nderfall)		1L decoder, d=2, 1h, hd=2	Tied Q/K + V/O projections, RoPE period-19, parabolic tied-embed decode, two-hinge ReLU MLP	gist
3	50	100%	lichengliu03		1L custom GPT, d=4, 2h, hd=2	Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11)	repo
4	66	100%	cosminscn		1L nanoGPT, d=4, 2h	Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11)	gist
5	87	100%	bingbangboom-lab		2L Qwen3, d=5, 2h/1kv, hd=2, ff=3	Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params	gist
6	93	100%	jacobli99		1L decoder, d=2, 5h (MQA), hd=2, ff=4	Tied parabolic decode, RoPE digit routing, ReLU carry detection	gist
7	111	100%	corbensorenson	Codex	1L decoder, d=3, 4h/1kv, hd=2, ff=2	Tied embed, RoPE, SwiGLU, GQA	repo
8	116	100%	nino		1L Qwen3, d=3, 4h/1kv, hd=2	Tied embed, shared RMSNorm vectors, RoPE (hd=2)	gist
9	121	100%	Wonderfall (@w0nderfall)	Codex	1L Qwen3, d=3, 4h/1kv, hd=2, ff=2	Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection	gist
10	130	100%	cosminscn		1L nanoGPT, d=4, 2h	Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding	gist
11	130	100%	Wonderfall (@w0nderfall)	Codex	1L Qwen3, d=3, 4h/1kv, hd=2, ff=3	Tied embed, RoPE digit routing, SiLU carry logic	gist
12	139	100%	Wonderfall (@w0nderfall)	GPT-5.2 Pro + Codex	1L Qwen3, d=3, 4h/1kv, hd=2	Tied embed, RoPE digit routing, SiLU carry logic	gist
13	148	100%	bingbangboom-lab		2L Qwen3, d=5, 2h/1kv, hd=2, ff=3	Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing	gist
14	177	100%	xangma (@xangma)	GPT + Codex	2L Qwen3, d=5, 2h/1kv, hd=2	Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head	gist
15	197	~100%*	xangma (@xangma)	GPT + Codex	2L Qwen3, d=5, 2h/1kv, hd=2	Rank-1 linear, factorized embed, sparse gate, param-free norm	gist

* Passed 8,192 random tests; not independently verified on our 10K test suite yet.

Trained Weights (Learned from Data)

Rank	Params	Accuracy	Author	Built with	Architecture	Key Tricks	Link
1	311	99.999%	rezabyt (@reza_byt)		1L decoder, d=4, 1h, ff=8	Rank-3 factorization, shared-A tied-KV, RMSNorm, grokking	repo
2	335	99.92%	h3nock		1L decoder, d=4, 1h, ff=12	Rank-3 factorization, shared-A tied-KV, RMSNorm, tied embed, curriculum learning	repo
3	456	100%	yinglunz		1L decoder, d=7, 1h, ff=14	Rank-3 factorization, shared-A tied-KV, rank-2 attn out, tied embed	repo
4	491	99.97%	rezabyt (@reza_byt)		1L decoder, d=7	Rank-3 factorization, RMSNorm, curriculum learning	repo
5	512	99.988%	yinglunz (@yinglun122)		1L decoder, d=7, 1h, ff=14	Rank-3 factorization	repo
6	777	99.69%	Yeb Havinga (@YebHavinga)	Claude Code	1L decoder, d=7, 1h, ff=14	Tied embeddings, no FFN bias, curriculum learning	repo
7	1,644	99.04%	anadim (@dimitrispapail)	Codex	1L decoder, pair tokens	Pair token encoding (digit pairs as single tokens)	repo
8	6,080	100%	anadim (@dimitrispapail)	Claude Code	2L decoder, d=16, ff=48	Systematic scaling, found phase transition at d=16	repo

The Core Constraint: Autoregressive Transformer

The model must operate as a genuine autoregressive transformer. This means:

Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.
The model must be autoregressive. It receives a token sequence as input and predicts the next token. Output digits are generated one at a time, with each new token fed back as input for predicting the next. The carry propagation must emerge from this autoregressive process — not from explicit state variables passed between steps in Python.
Standard forward pass. The model's forward() method must be a standard tensor-in, logits-out computation. No problem-specific control flow (for-loops over digits, explicit carry variables, string manipulation) inside forward(). The autoregressive generation loop lives outside the model, exactly as it would for any language model.
The model does the work, not the code. The inference code should be generic autoregressive decoding that would work with any transformer checkpoint. If your generation loop contains addition-specific logic — manually pairing digits, threading carry state, indexing into specific positions — then the Python code is solving the problem, not the model.

In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

Architectural variations: rank-1/low-rank projections, factorized embeddings, custom positional encodings, alternative norms
Hand-coded weights (constructive proofs are valid — they show the architecture can represent addition)
Trained weights via any generic learning algorithm (shows the solution is learnable — encourages creative ideas on data format, tokenization, and curriculum)
Input formatting choices (reversed digits, delimiters, etc.) as long as the format is fixed and doesn't encode the answer

Must achieve >= 99% accuracy on 10,000 random test pairs (held-out, fixed seed)
Inputs: two integers in [0, 9,999,999,999]
Output: their sum as an integer
Verified using verify.py with --seed 2025

Count unique parameters (after weight tying/deduplication)
Fixed/sinusoidal positional encodings are not counted (following the original Transformer paper convention)
Learned positional encodings are counted

Option A: Open an Issue (easiest)

Click New Issue and fill in the template
Include a link to your code (GitHub repo, gist, etc.)
Include test results (accuracy on random pairs)
We'll verify and add you to the leaderboard

Option B: Open a Pull Request

Fork this repo
Update the leaderboard in README.md with your entry
Include verification results
We'll review and merge

Updates to the leaderboard are welcome via pull request.

python verify.py submissions/your_submission.py

This runs:

10 edge cases (boundary values, max carry chains)
10,000 random pairs (seed=2025)
Reports accuracy, pass/fail, and timing

This challenge explores a fundamental question: what is the minimal transformer that can represent integer addition?

Addition requires three capabilities:

Digit alignment — pairing corresponding digits from two numbers
Per-digit arithmetic — computing sum and carry for each pair
Carry propagation — threading carry information across positions

Transformers solve these using attention (for alignment), MLPs (for arithmetic), and autoregressive generation (for carry propagation). The question is how small the architecture can be while still implementing all three.

Key Findings from the Community

Parameter cliff at ~800: Sharp accuracy transition observed by multiple researchers
Single layers beat two layers at equivalent parameter budgets (for trained models)
d=7 was the sweet spot for early trained models — multiple independent teams converged on this
d=4 now works with rank-3 factorization + grokking (311 params trained)
Hand-coded models can go much smaller (36 vs 311 trained) since they don't need to be discoverable by SGD
Rank-3 factorization is the key trick for trained models
ALiBi enables extreme compression: the 36-param leader uses ALiBi with slope log(10) for base-10 positional weighting, achieving 100% accuracy with a 2-layer decoder (d=5) in float64

MIT