NanoGPT 慢速运行：10倍数据效率，无限计算

NanoGPT 慢速运行：10倍数据效率，无限计算
NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

## NanoGPT Slowrun：实现 10 倍数据效率最近的 NanoGPT Slowrun 实验表明，实现了 **10 倍数据效率**——使用 1 亿个 token 达到了通常需要 10 亿个 token 的效果，使用了 18 亿参数模型的集成（总计 180 亿参数）。这一点意义重大，因为扩展智能越来越受到 *数据* 可用性的限制，而不是 *计算* 能力。这一突破显著偏离了 Chinchilla 等既定的扩展定律。驱动这种效率的关键技术包括：**集成训练**，即训练多个模型并对其预测结果求平均；**链式蒸馏**，按顺序训练模型以从前一个模型中提炼知识；激进的 **正则化**（权重衰减是标准的 16 倍）；以及 **循环 Transformer**，允许每个预测使用更多的计算资源。此外，一些 **架构调整**——例如独占自注意力机制和 U-Net 风格的跳跃连接——也带来了收益。团队认为，系统的架构搜索是未来至关重要的方向。目前的目标是在一年内达到 **100 倍数据效率**，这需要进一步的创新，但鉴于目前的进展，看起来是可行的。这项工作突出了通过计算扩展来提高模型性能的潜力，而不是受数据约束的限制。

## NanoGPT 慢速运行：无限计算下的数据效率 - 摘要最近的 Hacker News 讨论围绕 NanoGPT 慢速运行项目展开，探讨了大型语言模型 (LLM) 训练中的数据效率。核心思想是研究在大量计算资源的支持下，利用更少的数据能达到什么效果。对话强调了对数据效率思维的转变——不再仅仅关注减少参数，而是专注于从有限的数据集中提取最大信息量。参与者们讨论了人工智能和人类学习之间的比较，并指出训练数据量和人类获得的进化“预训练”之间存在巨大差异。一个关键点是，虽然生成合成数据正变得越来越普遍，但这并不能保证改进，而新的训练方法仍然至关重要。讨论还涉及了 LLM “学会学习”的可能性，以及模仿生物学习过程（如睡眠以巩固记忆）的好处。最终，慢速运行旨在突破预训练技术的界限，尤其是在数据稀缺的情况下，并探索新的想法是否能超越简单地扩大合成数据生成规模。

We've achieved 10x data efficiency with NanoGPT Slowrun within a few weeks. An ensemble of 1.8B parameter models (18B total params) trained on 100M tokens matches what would normally require 1B tokens with a standard LM baseline. Data efficiency matters because compute grows much faster than data . Since our current scaling laws require proportional increases in both , intelligence will eventually be bottlenecked by data, not compute. This data efficiency result allows us to improve model performance by scaling with compute rather than with data.

NanoGPT Slowrun

3.8× data efficiency

A few things worth noting. First, this looks nothing like our current scaling laws. Chinchilla says you should train a ~5M parameter model if you have 100M tokens -- a staggering 3600x difference from what we're doing. Second, 10x data efficiency would've seemed unimaginable to most people, and we got there in ... a few weeks. Here's how. Some of the trends are architectural tweaks without a lot of principles behind them. But a few are principled, and we believe they will transfer to larger scales. Those are what matter fundamentally.

Ensemble

Ensembling is probably the most understudied axis of scaling in pretraining. Instead of training one model, you train many models somewhat independently and aggregate their predictions at inference. This way, you can keep leveraging more compute under fixed data and keep improving generalization.

Training dynamics for ensembles are very different than for a single model. This is a key insight. Pandey et al. show that post-hoc transforms like ensembling reverse the usual overfitting dynamics: while base models overfit with more training, ensembles favor base models trained for more epochs. Kim et al. independently find that ensembling allows for much longer training than a single model.

We see exactly this. In PR #26, we extended training from 12 to 18 epochs. Individual model loss went from 3.295 to 3.310 -- it got worse. But ensemble loss dropped from 3.185 to 3.166. The models learn different things when you push them past their individual optimum, and that helps the ensemble.

Chain distillation. We've found that chain knowledge distillation dramatically improves ensemble training (PR #31). The idea, inspired by Born-Again Neural Networks , is to train models sequentially, where each new model distills from the immediately preceding one:

Algorithm: Chain Distillation Ensemble 1. Train model M_1 on data D with standard cross-entropy loss. 2. For k = 2, ..., K: a. Load M_{k-1} as a frozen teacher. b. Train model M_k from scratch on D with loss: L = (1 - α) · CE(M_k(x), y) + α · T² · KL(M_k(x)/T ‖ M_{k-1}(x)/T) where α = 0.5, T = 1.0 c. Discard teacher from memory. 3. At inference, ensemble all K models by averaging logits.

Note that only the immediately preceding model serves as teacher, not the full ensemble of prior models. This keeps memory constant and training fast. With 8 models trained this way in the chain distillation PR, individual model loss plateaus around 3.20, but ensemble loss hits 3.126 -- taking us from 7x to 8x data efficiency.

There's a ton of headroom here in scaling ensembles further.

Ensemble with chain distillation

Ensemble val loss with chain distillation as models are added

Regularization

Our theory is that generalization is closely related to compression -- in other words, simplicity . Regularization is a proxy for simplicity, particularly the techniques we've found most useful: L2 weight decay and dropout. It's no surprise regularization improves generalization. But the degree to which we can regularize is what's interesting.

We use weight decay up to 1.6 and dropout of 0.1. For context, standard practice is weight decay of ~0.1. Ours is 16x that. And it works because we're massively overparameterized: a 2.7B modelOur initial baseline was a 2.7B model; the model size is currently 1.8B. on 100M tokens, when Chinchilla says you should use ~5M parameters for that data. Kim et al. find optimal weight decay is up to 30x larger than standard practice in the data-constrained regime, and we've confirmed this aggressively. And the larger the model you train, the more regularization you need.

Looping

Looped transformers have better inductive biases than standard transformers because they allow the model to apply more compute per prediction. Instead of a single forward pass through the layers, the model iterates, refining its representations.

We start by training our 30 layer transformer without looping, and then halfway through training we loop layers 15-24 four times. This means we first run layers 0-24 of the transformer, then re-run layers 15-24 4 times, and finally run layers 25-29. This configuration was found to be optimal: it is important not to loop the last few layers. There remains more work in extending and formalizing these heuristics.

single model · val loss

Looping

Single model val loss as loop count increases

Architectural Changes

We've found some really good architectural changes, and the meta-pattern is that neural architecture search matters for data efficiency.

Exclusive Self Attention (XSA) removes the self-value projection from the attention output (PR #36). EMA (exponential moving average of model weights) combined with weight decay tuning and several other changes -- half-truncated RoPE, partial key offset for single-layer induction heads, tuned residual lambdas -- gave a nice bump (PR #29). U-Net skip connections between mirrored transformer layers (layers 0-14 feeding into layers 29-15 via learned scalar weights) helped (PR #17). SwiGLU activation replacing squared ReLU (PR #12). Value embeddings via learned projection from input embeddings, replacing separate embedding tables (PR #11).

Overall, these architectural tweaks keep giving significant data efficiency gains. It suggests that systematic architecture search is an important direction.

What's Next

100x. It will probably require a few new breakthroughs, but seems feasible within a year.

Contributors

@ChinmayK0607 · @not-nonymous · @shmublu · @zhiweixx · @em-see-squared · @ms337 · @kvegesna · @akshayvegesna

← Back to Q