使用 Transformer 学习伪随机数

使用 Transformer 学习伪随机数
Learning Pseudorandom Numbers with Transformers

这项研究调查了Transformer模型令人惊讶地学习和预测由排列线性同余生成器（PCG）生成的序列的能力，PCG是一种复杂的伪随机数生成器。尽管PCG比简单的生成器更复杂，Transformer仍然能够成功预测未知的PCG序列，即使仅限于预测单个输出位。该研究表明了一种缩放规律：预测准确性随着模型和数据集的增大而提高，但对于非常大的模数（≥ 2²⁰）需要课程学习——首先在较小的模数上进行训练。值得注意的是，当在多个PCG上同时训练时，模型会识别出共享的结构模式。对模型嵌入层的分析揭示了一种有趣的聚类现象：输入被分组到旋转不变的聚类中，表明了一种在不同模数大小之间传递学习到的表示的机制。这项工作突出了Transformer学习复杂数学结构的能力，并深入了解了它们的内部表示。

对不起。

[Submitted on 30 Oct 2025 (v1), last revised 16 Feb 2026 (this version, v2)]

View a PDF of the paper titled Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability, by Tao Tao and Maissam Barkeshli

View PDF HTML (experimental)

Abstract:We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the top principal components spontaneously group the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

From: Tao Tao [view email]
[v1] Thu, 30 Oct 2025 17:59:09 UTC (12,235 KB)
[v2] Mon, 16 Feb 2026 23:41:23 UTC (17,937 KB)

使用 Transformer 学习伪随机数 Learning Pseudorandom Numbers with Transformers

使用 Transformer 学习伪随机数
Learning Pseudorandom Numbers with Transformers