通过神经网络细胞自动机预训练语言模型

通过神经网络细胞自动机预训练语言模型
Pretraining Language Models via Neural Cellular Automata

原始链接: https://hanseungwook.github.io/blog/nca-pre-pre-training/

## 用于增强语言模型的神经细胞自动机研究人员发现，使用**神经细胞自动机 (NCA)** 数据（由神经网络生成的演化模式）预训练语言模型，在性能上出人意料地优于传统的自然语言预训练，即使在数据规模明显较小的情况下也是如此。NCA 生成多样化的、基于规则的序列，迫使模型**推断潜在模式**（上下文学习），而不是依赖于自然语言中存在的语义捷径。具体而言，16400万个 NCA token 始终优于相同数量的自然语言 (C4) 和其他合成数据，在网页文本、数学和代码任务中均表现更好，perplexity 更低，收敛速度更快。即使 C4 扩展到 16 亿个 token，NCA 仍然具有竞争力，表明**每个 token 的多样性和规则推断比单纯的数据量更重要。** 关键在于注意力层，它捕捉可迁移的计算基本单元。NCA 的复杂性可以调整——代码使用更简单的规则，数学/文本使用更复杂的规则——从而优化性能。这项研究表明，未来的基础模型可以在获取语义*之前*从合成数据中学习推理，从而可能减少偏差并提高效率。

## 使用神经网络细胞自动机预训练语言模型 - 摘要一篇最近的Hacker News帖子讨论了一种使用神经网络细胞自动机（NCAs）预训练语言模型的新方法。核心思想是摆脱对语言*语义*的依赖来进行初始训练，转而关注其*结构*。该假设认为，学习序列生成的底层规则，无论含义如何，都可以建立强大的推理能力。一些评论员将其与之前的迭代随机计算和“预预训练”方法相提并论。另一些人则探讨了与生物系统的联系，暗示与视觉皮层发育和神经系统生长中观察到的反馈循环相似。一个关键点是，智能是否根本需要自然语言，例如章鱼和乌鸦通过具身推理等不同方式展示了智能。讨论还涉及在模拟环境中的合成数据上训练模型的可能性，以及将这种“物理智能”与LLMs集成的挑战。最终目标是创建能够从合成数据进行推理，*然后*学习语义的基础模型，从而可能避免固有的偏见。

Neural cellular automata (NCA) generalize systems like Conway's Game of Life by replacing fixed rules with neural networks. Each randomly-sampled network defines a unique transition rule, producing diverse spatiotemporal dynamics on a grid. When unrolled over long horizons, these dynamics give rise to a rich spectrum of behaviors — from simple patterns that converge to a fixed attractor state to complex structures that emerge gradually over time.

Interactive NCA Simulation

Watch how different rules produce different complexity levels. Click to randomize.

gzip compression ratio controls complexity — more compressible = simpler dynamics

These NCA trajectories are tokenized into sequences (using 2×2 patches, similar to vision transformers) and fed to a standard transformer with next-token prediction. The key: since every sequence has a unique latent rule, the model must infer that rule in-context to predict what comes next. This in-context learning ability underpins many of the key reasoning capabilities observed in language models.

Stage 1

Pre-pre-train

164M NCA tokens
Synthetic dynamics

→

Stage 2

Pre-train

Natural language
Web, math, code (4-13B tokens)

→

Stage 3

Fine-tune

Task-specific
Instruction tuning (

03 — Results

The surprising payoff

Under matched token budgets (164M tokens each), NCA pre-pre-training consistently outperforms from-scratch training, pre-pre-training on natural language (C4), and pre-pre-training on other synthetic data (Dyck) across web text, math, and code. The gains aren't just better convergence speed, but also better final perplexity.

Final Perplexity by Domain (↓ lower is better)

OpenWebText

NCA −5.7%

Scr C4 Dyck NCA

OpenWebMath

NCA −5.2%

Scr C4 Dyck NCA

CodeParrot

NCA −4.2%

Scr C4 Dyck NCA

Scratch C4 (natural language) Dyck (synthetic) NCA (synthetic)

Validation Perplexity During Training

Scratch C4 Dyck NCA

These language modeling gains transfer to real reasoning benchmarks:

Reasoning Benchmark Performance

GSM8K (Math) (pass@1 accuracy)

HumanEval (Code) (pass@1 accuracy)

BigBench-Lite (Reasoning) (normalized accuracy pass@2)

Scratch C4 Dyck NCA

Surprisingly, we observe that our non-linguistic NCA data outperforms natural language at equal scale. So, we investigate further: what happens if we give C4 ~10× more data? We scale C4 pre-pre-training to 1.6B tokens while keeping NCA at 164M. Even with this data advantage, NCA still converges 1.4x faster and achieves 5% better final perplexity .

OpenWebText: NCA (164M tokens) vs C4 (1.6B tokens)

Scratch NCA (164M) Dyck (164M) C4 (1.6B) C4 (1.6B, no reinit)

164M tokens of automata beats 1.6B tokens of natural language. We believe the difference reflects what each data source teaches at each scale . At 1.6B tokens — far below the compute-optimal scale — C4 largely teaches shallow, local patterns, while each NCA sequence trains the model to infer a latent rule from context (i.e., in-context learning) and apply it consistently. This per-token diversity in functions, rather than redundant linguistic patterns, appears more efficient at building the general-purpose representations that transfer to language.

04 — Key Insights

What drives the transfer?

Attention is the carrier

Re-initialization experiments show attention layers capture the most transferable computational primitives. MLPs encode domain-specific knowledge — transferable only when source and target align.

Complexity must match

The optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text prefer more complex ones. This opens a new lever for targeted training.

Structure, not semantics

NCA data has zero linguistic content — yet teaches models to track long-range dependencies and infer latent rules, the same capabilities needed for language.

Efficiency over scale

More synthetic data isn't always better. Calibrating the complexity of the data generator matters more than raw volume, enabling smarter training with less compute.

Optimal Complexity by Domain

Click a domain to see which NCA complexity band transfers best

05 — Why It Works

A purer training signal

At small token budgets, natural language pre-training mostly teaches shallow patterns. Models exploit semantic shortcuts and co-occurrence priors rather than learning to reason from structure. On the other hand, NCA sequences do not contain any semantic shortcuts.

Every NCA trajectory is generated by a hidden transition rule — a randomly sampled neural network — that the model must infer purely from context. Since there's no semantic content to fall back on, every token pushes the model toward in-context rule inference: observing a sequence, hypothesizing the underlying rule, and applying it consistently forward. This mirrors one of the core capabilities of language models (i.e., in-context learning).

Because NCA rules are drawn from a universal class of computable functions — some realizing Turing-complete systems — the distribution is too vast to memorize. The model is forced to learn a general mechanism for rule inference rather than memorizing specific rules. This is supported by our empirical findings: attention layers, not the MLPs, carry the most transferable structure. Prior work shows that in-context learning ability emerges with the formation of induction heads —- attention circuits that copy and apply patterns from earlier in the sequence. NCA pre-pre-training exclusively rewards this behavior, likely inducing earlier and more robust formation of these circuits before language training begins.

06 — The Big Picture

Beyond one-size-fits-all

This work opens a fundamentally new axis of control for training language models. Instead of treating the training distribution as fixed, we can tune the structure of synthetic data to match target domains. Simpler NCA rules for code, and richer long-range dynamics for genomic sequence modeling.

The long-term vision is: foundation models that acquire reasoning from fully synthetic data, then learn semantics from a small, curated corpus of natural language. This would help us build models that reason without inheriting human biases from inception.

The question is no longer whether synthetic pre-training can work, but how far it can go.

Citation

If you find this work useful, please consider citing our paper:

@inproceedings{placeholder2026nca, title={Training Language Models via Neural Cellular Automata}, author={Seungwook Han and Dan Lee and Akarsh Kumar and Pulkit Agrawal}, booktitle={TBD, year={2026} }