能够相加两个10位数的最小变压器。
Smallest transformer that can add two 10-digit numbers

原始链接: https://github.com/anadim/AdderBoard

## 极简Transformer加法挑战:总结 这项挑战旨在发现能够准确加两个10位数字的最小Transformer模型(在包含10,000个数字的测试集上达到>=99%的准确率)。该项目最初由Claude Code和Codex发起,并在社区贡献下取得了显著进展,大幅减少了模型尺寸。 追踪两大类模型:**训练**模型(通过SGD等算法学习权重)和**手工编码**模型(分析性定义权重,证明架构能力)。目前领先的模型非常小巧——一个手工编码的解决方案仅用**36个参数**就实现了100%的准确率,而一个训练模型则利用秩3分解,以**311个参数**达到了99.999%的准确率。 推动这些结果的关键技术包括低秩投影、分解嵌入、自定义位置编码(如ALiBi)和课程学习。核心约束是真正的自回归Transformer——自注意力是*必需的*,并且进位必须从模型的生成过程中产生,而不是显式代码。这项挑战突出了大约800个参数附近的“参数悬崖”,并展示了秩3分解等技术对于可学习解决方案的强大作用。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 能够相加两个10位数字的最小Transformer (github.com/anadim) 13 分,ks2048 1小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 amelius 0分钟前 [–] > 简而言之:如果你可以替换不同的权重集,并为不同的任务使用完全相同的推理代码,那么你的设置就是合法的。如果推理代码与算法不可分割,则不是。 我想知道他们为什么不自己编写代码,这样设计上就可以专注于模型。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

AdderBoard

Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.

This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.

Maintained by Dimitris Papailiopoulos (@dimitrispapail).

We track two categories:

  • Trained — weights learned from data by any training algorithm (SGD, Adam, evolutionary search, etc.). The algorithm must be generic — it should work with any model and dataset, not just this specific problem. This encourages creative ideas around data format, tokenization, curriculum learning, and architecture search.
  • Hand-coded — weights set analytically. This is a constructive proof that the architecture can represent addition, regardless of whether SGD would find it.

Both are valid. Both are interesting.

Hand-Coded Weights (Constructive Proofs)

Rank Params Accuracy Author Built with Architecture Key Tricks Link
1 36 100% alexlitz 2L decoder, d=5, 5h+1h ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64 gist
2 40 100% Wonderfall (@w0nderfall) 1L decoder, d=2, 1h, hd=2 Tied Q/K + V/O projections, RoPE period-19, parabolic tied-embed decode, two-hinge ReLU MLP gist
3 50 100% lichengliu03 1L custom GPT, d=4, 2h, hd=2 Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11) repo
4 66 100% cosminscn 1L nanoGPT, d=4, 2h Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11) gist
5 87 100% bingbangboom-lab 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params gist
6 93 100% jacobli99 1L decoder, d=2, 5h (MQA), hd=2, ff=4 Tied parabolic decode, RoPE digit routing, ReLU carry detection gist
7 111 100% corbensorenson Codex 1L decoder, d=3, 4h/1kv, hd=2, ff=2 Tied embed, RoPE, SwiGLU, GQA repo
8 116 100% nino 1L Qwen3, d=3, 4h/1kv, hd=2 Tied embed, shared RMSNorm vectors, RoPE (hd=2) gist
9 121 100% Wonderfall (@w0nderfall) Codex 1L Qwen3, d=3, 4h/1kv, hd=2, ff=2 Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection gist
10 130 100% cosminscn 1L nanoGPT, d=4, 2h Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding gist
11 130 100% Wonderfall (@w0nderfall) Codex 1L Qwen3, d=3, 4h/1kv, hd=2, ff=3 Tied embed, RoPE digit routing, SiLU carry logic gist
12 139 100% Wonderfall (@w0nderfall) GPT-5.2 Pro + Codex 1L Qwen3, d=3, 4h/1kv, hd=2 Tied embed, RoPE digit routing, SiLU carry logic gist
13 148 100% bingbangboom-lab 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing gist
14 177 100% xangma (@xangma) GPT + Codex 2L Qwen3, d=5, 2h/1kv, hd=2 Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head gist
15 197 ~100%* xangma (@xangma) GPT + Codex 2L Qwen3, d=5, 2h/1kv, hd=2 Rank-1 linear, factorized embed, sparse gate, param-free norm gist

* Passed 8,192 random tests; not independently verified on our 10K test suite yet.

Trained Weights (Learned from Data)

Rank Params Accuracy Author Built with Architecture Key Tricks Link
1 311 99.999% rezabyt (@reza_byt) 1L decoder, d=4, 1h, ff=8 Rank-3 factorization, shared-A tied-KV, RMSNorm, grokking repo
2 335 99.92% h3nock 1L decoder, d=4, 1h, ff=12 Rank-3 factorization, shared-A tied-KV, RMSNorm, tied embed, curriculum learning repo
3 456 100% yinglunz 1L decoder, d=7, 1h, ff=14 Rank-3 factorization, shared-A tied-KV, rank-2 attn out, tied embed repo
4 491 99.97% rezabyt (@reza_byt) 1L decoder, d=7 Rank-3 factorization, RMSNorm, curriculum learning repo
5 512 99.988% yinglunz (@yinglun122) 1L decoder, d=7, 1h, ff=14 Rank-3 factorization repo
6 777 99.69% Yeb Havinga (@YebHavinga) Claude Code 1L decoder, d=7, 1h, ff=14 Tied embeddings, no FFN bias, curriculum learning repo
7 1,644 99.04% anadim (@dimitrispapail) Codex 1L decoder, pair tokens Pair token encoding (digit pairs as single tokens) repo
8 6,080 100% anadim (@dimitrispapail) Claude Code 2L decoder, d=16, ff=48 Systematic scaling, found phase transition at d=16 repo

The Core Constraint: Autoregressive Transformer

The model must operate as a genuine autoregressive transformer. This means:

  1. Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.

  2. The model must be autoregressive. It receives a token sequence as input and predicts the next token. Output digits are generated one at a time, with each new token fed back as input for predicting the next. The carry propagation must emerge from this autoregressive process — not from explicit state variables passed between steps in Python.

  3. Standard forward pass. The model's forward() method must be a standard tensor-in, logits-out computation. No problem-specific control flow (for-loops over digits, explicit carry variables, string manipulation) inside forward(). The autoregressive generation loop lives outside the model, exactly as it would for any language model.

  4. The model does the work, not the code. The inference code should be generic autoregressive decoding that would work with any transformer checkpoint. If your generation loop contains addition-specific logic — manually pairing digits, threading carry state, indexing into specific positions — then the Python code is solving the problem, not the model.

In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

  • Architectural variations: rank-1/low-rank projections, factorized embeddings, custom positional encodings, alternative norms
  • Hand-coded weights (constructive proofs are valid — they show the architecture can represent addition)
  • Trained weights via any generic learning algorithm (shows the solution is learnable — encourages creative ideas on data format, tokenization, and curriculum)
  • Input formatting choices (reversed digits, delimiters, etc.) as long as the format is fixed and doesn't encode the answer
  • Must achieve >= 99% accuracy on 10,000 random test pairs (held-out, fixed seed)
  • Inputs: two integers in [0, 9,999,999,999]
  • Output: their sum as an integer
  • Verified using verify.py with --seed 2025
  • Count unique parameters (after weight tying/deduplication)
  • Fixed/sinusoidal positional encodings are not counted (following the original Transformer paper convention)
  • Learned positional encodings are counted

Option A: Open an Issue (easiest)

  1. Click New Issue and fill in the template
  2. Include a link to your code (GitHub repo, gist, etc.)
  3. Include test results (accuracy on random pairs)
  4. We'll verify and add you to the leaderboard

Option B: Open a Pull Request

  1. Fork this repo
  2. Update the leaderboard in README.md with your entry
  3. Include verification results
  4. We'll review and merge

Updates to the leaderboard are welcome via pull request.

python verify.py submissions/your_submission.py

This runs:

  • 10 edge cases (boundary values, max carry chains)
  • 10,000 random pairs (seed=2025)
  • Reports accuracy, pass/fail, and timing

This challenge explores a fundamental question: what is the minimal transformer that can represent integer addition?

Addition requires three capabilities:

  1. Digit alignment — pairing corresponding digits from two numbers
  2. Per-digit arithmetic — computing sum and carry for each pair
  3. Carry propagation — threading carry information across positions

Transformers solve these using attention (for alignment), MLPs (for arithmetic), and autoregressive generation (for carry propagation). The question is how small the architecture can be while still implementing all three.

Key Findings from the Community

  • Parameter cliff at ~800: Sharp accuracy transition observed by multiple researchers
  • Single layers beat two layers at equivalent parameter budgets (for trained models)
  • d=7 was the sweet spot for early trained models — multiple independent teams converged on this
  • d=4 now works with rank-3 factorization + grokking (311 params trained)
  • Hand-coded models can go much smaller (36 vs 311 trained) since they don't need to be discoverable by SGD
  • Rank-3 factorization is the key trick for trained models
  • ALiBi enables extreme compression: the 36-param leader uses ALiBi with slope log(10) for base-10 positional weighting, achieving 100% accuracy with a 2-layer decoder (d=5) in float64

MIT

联系我们 contact @ memedata.com