NanoGPT 慢跑：有限数据、无限计算的语言模型

NanoGPT 慢跑：有限数据、无限计算的语言模型
NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

## NanoGPT Slowrun：提升人工智能的数据效率 Q Labs 的“NanoGPT Slowrun”是一个开源项目，专注于开发数据高效的学习算法。该项目认识到，数据而非计算将成为人工智能发展的主要瓶颈（不同于当前语言模型的趋势），因此它挑战研究人员在固定的 1 亿 token 数据集上，利用*无限*的计算资源，实现最低的验证损失。这与典型的“速度竞赛”基准测试形成对比，允许探索计算成本高昂但可能具有影响力的技术，例如强正则化和替代优化器。初步结果令人鼓舞：社区贡献已经实现了比标准 NanoGPT **5.5 倍的数据效率**，从 2.4 倍的基线水平提升。关键改进包括 epoch 洗牌、学习投影和激活函数交换。该项目短期目标是实现 **10 倍的数据效率，并有可能在年底前达到 100 倍**，重点关注二阶优化器、扩散模型和课程学习等领域。Q Labs 鼓励通过其开放仓库进行贡献和协作。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 NanoGPT 慢速运行：有限数据、无限计算的语言模型 (qlabs.sh) 9 分，作者 sdpmas 36 分钟前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

NanoGPT Slowrun - Q

March 2026

NanoGPT Slowrun is an open effort to implement data-efficient learning algorithms; 5.5x data efficiency in the first week and improving.

Compute grows much faster than data . Our current scaling laws require proportional increases in both to scale . But the asymmetry in their growth means intelligence will eventually be bottlenecked by data, not compute. This is easy to see if you look at almost anything other than language models. In robotics and biology, the massive data requirement leads to weak models, and both fields have enough economic incentives to leverage 1000x more compute if that led to significantly better results. But they can't, because nobody knows how to scale with compute alone without adding more data. The solution is to build new learning algorithms that work in limited data, practically infinite compute settings. This is what we are solving at Q Labs: our goal is to understand and solve generalization.

Slowrun baseline on 100M tokens — NanoGPT Slowrun baseline: 2.4x data efficiency

Last week we released NanoGPT Slowrun , an open repo for data-efficient learning algorithms. The rules are simple: train on 100M tokens from FineWeb, use as much compute as you want, lowest validation loss wins. Improvements are submitted as PRs to the repo and merged if they lower val loss. The constraint is the inverse of speedruns like modded-nanogpt , which optimize wall-clock time. Those benchmarks have been hugely productive, but optimizing for speed filters out expensive ideas: heavy regularization, second-order optimizers, gradient descent alternatives. Slowrun is built for exactly those ideas.

What we've found so far

Muon outperforms every optimizer we tested (AdamW, SOAP, MAGMA). Multi-epoch training matters. And following work by Kotha et al. , scaling to large parameter counts works if you pair it with aggressive regularization -- weight decay up to 16x standard, plus dropout. The baseline sits at ~2.4x data efficiency against modded-nanogpt.

Update: 5.5x Data Efficiency

Since the initial release, community contributions have pushed data efficiency from ~2.4x to 5.5x against modded-nanogpt, more than doubling in a few days. The key changes are: shuffling at the start of each epoch, which had outsized impact on multi-epoch training; learned projections for value embeddings instead of separate embedding tables; swapping squared ReLU for SwiGLU activation; and ensembling multiple models. 10x data efficiency seems reachable in the short term. 100x might be feasible by the end of the year, given how many directions remain unexplored, but it will require serious exploration on the algorithms side.

Updated Slowrun — Updates to 5.5x Data Efficiency

Directions we think are wide open

Second-order optimizers and natural gradient methods
Diffusion models
Curriculum learning
Gradient descent alternatives like evolutionary search
Optimizing for compression/model-complexity

If you're working on any of this or something we haven't thought of, open an issue on the repo, or email [email protected].

← Back to Q