国际象棋引擎会做出奇怪的举动。

国际象棋引擎会做出奇怪的举动。
Chess engines do weird stuff

## 国际象棋引擎训练对LLM的启示现代国际象棋引擎，如Leela Chess Zero (Lc0)，为大型语言模型 (LLM) 训练提供了宝贵的见解。最初，引擎使用强化学习 (RL) 通过自我对弈进行训练。然而，研究表明，RL 真正需要的只有 *一次* – 用于创建一个强大的初始模型。后续引擎可以“提炼”来自该模型 *以及* 搜索算法的知识，从而绕过昂贵的博弈生成。弱模型配合强大的搜索优于强模型不配合搜索，这使得搜索成为关键组成部分。 Lc0 发现进一步的 RL 实际上 *降低* 了性能，突显了从搜索中提炼知识的力量。这与 LLM “n 选 1” 采样不同，后者相比于国际象棋搜索所带来的显著优势，收益有限。进一步的改进来自于诸如 SPSA 之类的技术 – 随机扰动模型权重并选择导致胜利的变化，即使 *没有* 梯度。虽然计算成本高昂，但 SPSA 能够带来显著的收益。这一原理超越了权重，允许优化引擎代码中的任何参数。最后，Lc0 向 Transformer 架构和一种新型注意力偏差系统 ("smolgen") 的转变，展示了这些架构的广泛适用性以及专门组件能够大幅提升性能的潜力。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录国际象棋引擎会做奇怪的事情 (girl.surgery) 14 分，由 admiringly 发表于 26 分钟前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Things LLM people can learn from

Training method

Since AlphaZero, lc0-style chess engines have been trained with RL. Specifically, you have the engine (search + model) play itself a bunch of times, and train the model to predict the outcome of the game.

It turns out this isn't necessary. Good model vs bad model is ~200 elo, but search is ~1200 elo, so even a bad model + search is essentially an oracle to a good model without, and you can distill from bad model + search → good model.

So RL was necessary in some sense only one time. Once a good model with search was trained, every future engine (including their competitors!)1 can distill from that, and doesn't have to generate games (expensive). lc0 trained their premier model, BT4, with distillation and it got worse when you put it in the RL loop.

What makes distillation from search so powerful? People often compare this to distilling from best-of-n in RL, which I think is limited — a chess engine that runs the model on 50 positions is roughly equivalent to a model 30x larger, whereas LLM best-of-50 is generously worth a model 2x larger. Perhaps this was why people wanted test-time search to work so badly when RLVR was right under their noses.

Training at runtime

A recent technique is applying the distillation trick at runtime. At runtime, you evaluate early positions with your NN, then search them and get a more accurate picture. If your network says the position is +0.15 pawns better than search says, subtract 0.15 pawns from future evaluations. Your network live adapts to the position it's in!

Training on winning

The fundamental training objective of distilling from search is almost but not quite what we actually care about: winning. It's very correlated, but we don't actually care about how well the model estimates one position, we care about how well it performs after search, after looking at 100 positions.

To fix this, lc0 uses a weird technique called SPSA: you randomly perturb the weights in two directions, play a bunch of games, and go the direction that wins more.2 This works very well and can get +50 elo on small models.3

Consider for a moment how insane it is that this works at all. You're modifying the weights in purely random directions. You have no gradient whatsoever. And yet it works quite well! +50 elo is ~1.5x model size or ~a year's worth of development effort!

The main issue with this is that it's wildly expensive. To do a single step you must play thousands of games with dozens of moves and hundreds of position inferences per move.

Like LLMs, you train for a long time on a pseudo-objective that's close to what you want, then a short time on a very expensive and limited objective that's closer to what you want.

Tuning through C++

The underlying technique of SPSA can be applied to literally any number in your chess program. Modify the number, see if it wins more or loses more, move in the direction that wins more. You have a hand-tuned heuristic that if there's a checkmate in the search from a position you should back off by depth 1? Replace that with thousandths-of-a-depth and then tune it with SPSA — turns out the optimal value is actually to back off by depth 1.09, which nets you 5 elo. You can do this for every number in your search algorithm. You can do something that looks a lot like gradient descent through arbitrary C++ because you have a grading function (winning).

Weird architecture

lc0 uses a standard-ish transformer architecture, which they found to be hundreds of elo better than their old convolution-based models. It's still confusing to me that the transformer biases apply to literally every application imaginable. The only substantial architectural change they use is "smolgen", a system for generating attention biases. They claim smolgen is a ~1.2x throughput hit but an accuracy win equivalent to 2.5x model size. Why is it so good? I find all the explanations poor.