Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.
This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.
Maintained by Dimitris Papailiopoulos (@dimitrispapail).
We track two categories:
- Trained — weights learned from data by any training algorithm (SGD, Adam, evolutionary search, etc.). The algorithm must be generic — it should work with any model and dataset, not just this specific problem. This encourages creative ideas around data format, tokenization, curriculum learning, and architecture search.
- Hand-coded — weights set analytically. This is a constructive proof that the architecture can represent addition, regardless of whether SGD would find it.
Both are valid. Both are interesting.
| Rank | Params | Accuracy | Author | Built with | Architecture | Key Tricks | Link |
|---|---|---|---|---|---|---|---|
| 1 | 36 | 100% | alexlitz | 2L decoder, d=5, 5h+1h | ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64 | gist | |
| 2 | 40 | 100% | Wonderfall (@w0nderfall) | 1L decoder, d=2, 1h, hd=2 | Tied Q/K + V/O projections, RoPE period-19, parabolic tied-embed decode, two-hinge ReLU MLP | gist | |
| 3 | 50 | 100% | lichengliu03 | 1L custom GPT, d=4, 2h, hd=2 | Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11) | repo | |
| 4 | 66 | 100% | cosminscn | 1L nanoGPT, d=4, 2h | Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11) | gist | |
| 5 | 87 | 100% | bingbangboom-lab | 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 | Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params | gist | |
| 6 | 93 | 100% | jacobli99 | 1L decoder, d=2, 5h (MQA), hd=2, ff=4 | Tied parabolic decode, RoPE digit routing, ReLU carry detection | gist | |
| 7 | 111 | 100% | corbensorenson | Codex | 1L decoder, d=3, 4h/1kv, hd=2, ff=2 | Tied embed, RoPE, SwiGLU, GQA | repo |
| 8 | 116 | 100% | nino | 1L Qwen3, d=3, 4h/1kv, hd=2 | Tied embed, shared RMSNorm vectors, RoPE (hd=2) | gist | |
| 9 | 121 | 100% | Wonderfall (@w0nderfall) | Codex | 1L Qwen3, d=3, 4h/1kv, hd=2, ff=2 | Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection | gist |
| 10 | 130 | 100% | cosminscn | 1L nanoGPT, d=4, 2h | Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding | gist | |
| 11 | 130 | 100% | Wonderfall (@w0nderfall) | Codex | 1L Qwen3, d=3, 4h/1kv, hd=2, ff=3 | Tied embed, RoPE digit routing, SiLU carry logic | gist |
| 12 | 139 | 100% | Wonderfall (@w0nderfall) | GPT-5.2 Pro + Codex | 1L Qwen3, d=3, 4h/1kv, hd=2 | Tied embed, RoPE digit routing, SiLU carry logic | gist |
| 13 | 148 | 100% | bingbangboom-lab | 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 | Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing | gist | |
| 14 | 177 | 100% | xangma (@xangma) | GPT + Codex | 2L Qwen3, d=5, 2h/1kv, hd=2 | Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head | gist |
| 15 | 197 | ~100%* | xangma (@xangma) | GPT + Codex | 2L Qwen3, d=5, 2h/1kv, hd=2 | Rank-1 linear, factorized embed, sparse gate, param-free norm | gist |
* Passed 8,192 random tests; not independently verified on our 10K test suite yet.
| Rank | Params | Accuracy | Author | Built with | Architecture | Key Tricks | Link |
|---|---|---|---|---|---|---|---|
| 1 | 311 | 99.999% | rezabyt (@reza_byt) | 1L decoder, d=4, 1h, ff=8 | Rank-3 factorization, shared-A tied-KV, RMSNorm, grokking | repo | |
| 2 | 335 | 99.92% | h3nock | 1L decoder, d=4, 1h, ff=12 | Rank-3 factorization, shared-A tied-KV, RMSNorm, tied embed, curriculum learning | repo | |
| 3 | 456 | 100% | yinglunz | 1L decoder, d=7, 1h, ff=14 | Rank-3 factorization, shared-A tied-KV, rank-2 attn out, tied embed | repo | |
| 4 | 491 | 99.97% | rezabyt (@reza_byt) | 1L decoder, d=7 | Rank-3 factorization, RMSNorm, curriculum learning | repo | |
| 5 | 512 | 99.988% | yinglunz (@yinglun122) | 1L decoder, d=7, 1h, ff=14 | Rank-3 factorization | repo | |
| 6 | 777 | 99.69% | Yeb Havinga (@YebHavinga) | Claude Code | 1L decoder, d=7, 1h, ff=14 | Tied embeddings, no FFN bias, curriculum learning | repo |
| 7 | 1,644 | 99.04% | anadim (@dimitrispapail) | Codex | 1L decoder, pair tokens | Pair token encoding (digit pairs as single tokens) | repo |
| 8 | 6,080 | 100% | anadim (@dimitrispapail) | Claude Code | 2L decoder, d=16, ff=48 | Systematic scaling, found phase transition at d=16 | repo |
The model must operate as a genuine autoregressive transformer. This means:
-
Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.
-
The model must be autoregressive. It receives a token sequence as input and predicts the next token. Output digits are generated one at a time, with each new token fed back as input for predicting the next. The carry propagation must emerge from this autoregressive process — not from explicit state variables passed between steps in Python.
-
Standard forward pass. The model's
forward()method must be a standard tensor-in, logits-out computation. No problem-specific control flow (for-loops over digits, explicit carry variables, string manipulation) insideforward(). The autoregressive generation loop lives outside the model, exactly as it would for any language model. -
The model does the work, not the code. The inference code should be generic autoregressive decoding that would work with any transformer checkpoint. If your generation loop contains addition-specific logic — manually pairing digits, threading carry state, indexing into specific positions — then the Python code is solving the problem, not the model.
In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.
- Architectural variations: rank-1/low-rank projections, factorized embeddings, custom positional encodings, alternative norms
- Hand-coded weights (constructive proofs are valid — they show the architecture can represent addition)
- Trained weights via any generic learning algorithm (shows the solution is learnable — encourages creative ideas on data format, tokenization, and curriculum)
- Input formatting choices (reversed digits, delimiters, etc.) as long as the format is fixed and doesn't encode the answer
- Must achieve >= 99% accuracy on 10,000 random test pairs (held-out, fixed seed)
- Inputs: two integers in [0, 9,999,999,999]
- Output: their sum as an integer
- Verified using
verify.pywith--seed 2025
- Count unique parameters (after weight tying/deduplication)
- Fixed/sinusoidal positional encodings are not counted (following the original Transformer paper convention)
- Learned positional encodings are counted
Option A: Open an Issue (easiest)
- Click New Issue and fill in the template
- Include a link to your code (GitHub repo, gist, etc.)
- Include test results (accuracy on random pairs)
- We'll verify and add you to the leaderboard
Option B: Open a Pull Request
- Fork this repo
- Update the leaderboard in README.md with your entry
- Include verification results
- We'll review and merge
Updates to the leaderboard are welcome via pull request.
python verify.py submissions/your_submission.pyThis runs:
- 10 edge cases (boundary values, max carry chains)
- 10,000 random pairs (seed=2025)
- Reports accuracy, pass/fail, and timing
This challenge explores a fundamental question: what is the minimal transformer that can represent integer addition?
Addition requires three capabilities:
- Digit alignment — pairing corresponding digits from two numbers
- Per-digit arithmetic — computing sum and carry for each pair
- Carry propagation — threading carry information across positions
Transformers solve these using attention (for alignment), MLPs (for arithmetic), and autoregressive generation (for carry propagation). The question is how small the architecture can be while still implementing all three.
- Parameter cliff at ~800: Sharp accuracy transition observed by multiple researchers
- Single layers beat two layers at equivalent parameter budgets (for trained models)
- d=7 was the sweet spot for early trained models — multiple independent teams converged on this
- d=4 now works with rank-3 factorization + grokking (311 params trained)
- Hand-coded models can go much smaller (36 vs 311 trained) since they don't need to be discoverable by SGD
- Rank-3 factorization is the key trick for trained models
- ALiBi enables extreme compression: the 36-param leader uses ALiBi with slope log(10) for base-10 positional weighting, achieving 100% accuracy with a 2-layer decoder (d=5) in float64
MIT
