Neural cellular automata (NCA) generalize systems like Conway's Game of Life by replacing fixed rules with neural networks. Each randomly-sampled network defines a unique transition rule, producing diverse spatiotemporal dynamics on a grid. When unrolled over long horizons, these dynamics give rise to a rich spectrum of behaviors — from simple patterns that converge to a fixed attractor state to complex structures that emerge gradually over time.
Interactive NCA Simulation
Watch how different rules produce different complexity levels. Click to randomize.
gzip compression ratio controls complexity — more compressible = simpler dynamics
These NCA trajectories are tokenized into sequences (using 2×2 patches, similar to vision transformers) and fed to a standard transformer with next-token prediction. The key: since every sequence has a unique latent rule, the model must infer that rule in-context to predict what comes next. This in-context learning ability underpins many of the key reasoning capabilities observed in language models.
Stage 1
Pre-pre-train
164M NCA tokens
Synthetic dynamics
→
Stage 2
Pre-train
Natural language
Web, math, code (4-13B tokens)
→
Stage 3
Fine-tune
Task-specific
Instruction tuning (
03 — Results
The surprising payoff
Under matched token budgets (164M tokens each), NCA pre-pre-training consistently outperforms from-scratch training, pre-pre-training on natural language (C4), and pre-pre-training on other synthetic data (Dyck) across web text, math, and code. The gains aren't just better convergence speed, but also better final perplexity.
Final Perplexity by Domain (↓ lower is better)
Scratch C4 (natural language) Dyck (synthetic) NCA (synthetic)
Validation Perplexity During Training
Scratch C4 Dyck NCA
These language modeling gains transfer to real reasoning benchmarks:
Reasoning Benchmark Performance
Scratch C4 Dyck NCA
Surprisingly, we observe that our non-linguistic NCA data outperforms natural language at equal scale. So, we investigate further: what happens if we give C4 ~10× more data? We scale C4 pre-pre-training to 1.6B tokens while keeping NCA at 164M. Even with this data advantage, NCA still converges 1.4x faster and achieves 5% better final perplexity .
OpenWebText: NCA (164M tokens) vs C4 (1.6B tokens)
Scratch NCA (164M) Dyck (164M) C4 (1.6B) C4 (1.6B, no reinit)
164M tokens of automata beats 1.6B tokens of natural language. We believe the difference reflects what each data source teaches at each scale . At 1.6B tokens — far below the compute-optimal scale — C4 largely teaches shallow, local patterns, while each NCA sequence trains the model to infer a latent rule from context (i.e., in-context learning) and apply it consistently. This per-token diversity in functions, rather than redundant linguistic patterns, appears more efficient at building the general-purpose representations that transfer to language.
04 — Key Insights
What drives the transfer?
Attention is the carrier
Re-initialization experiments show attention layers capture the most transferable computational primitives. MLPs encode domain-specific knowledge — transferable only when source and target align.
Complexity must match
The optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text prefer more complex ones. This opens a new lever for targeted training.
Structure, not semantics
NCA data has zero linguistic content — yet teaches models to track long-range dependencies and infer latent rules, the same capabilities needed for language.
Efficiency over scale
More synthetic data isn't always better. Calibrating the complexity of the data generator matters more than raw volume, enabling smarter training with less compute.
Optimal Complexity by Domain
Click a domain to see which NCA complexity band transfers best
05 — Why It Works
A purer training signal
At small token budgets, natural language pre-training mostly teaches shallow patterns. Models exploit semantic shortcuts and co-occurrence priors rather than learning to reason from structure. On the other hand, NCA sequences do not contain any semantic shortcuts.
Every NCA trajectory is generated by a hidden transition rule — a randomly sampled neural network — that the model must infer purely from context. Since there's no semantic content to fall back on, every token pushes the model toward in-context rule inference: observing a sequence, hypothesizing the underlying rule, and applying it consistently forward. This mirrors one of the core capabilities of language models (i.e., in-context learning).
Because NCA rules are drawn from a universal class of computable functions — some realizing Turing-complete systems — the distribution is too vast to memorize. The model is forced to learn a general mechanism for rule inference rather than memorizing specific rules. This is supported by our empirical findings: attention layers, not the MLPs, carry the most transferable structure. Prior work shows that in-context learning ability emerges with the formation of induction heads —- attention circuits that copy and apply patterns from earlier in the sequence. NCA pre-pre-training exclusively rewards this behavior, likely inducing earlier and more robust formation of these circuits before language training begins.
06 — The Big Picture
Beyond one-size-fits-all
This work opens a fundamentally new axis of control for training language models. Instead of treating the training distribution as fixed, we can tune the structure of synthetic data to match target domains. Simpler NCA rules for code, and richer long-range dynamics for genomic sequence modeling.
The long-term vision is: foundation models that acquire reasoning from fully synthetic data, then learn semantics from a small, curated corpus of natural language. This would help us build models that reason without inheriting human biases from inception.
The question is no longer whether synthetic pre-training can work, but how far it can go.
Citation
If you find this work useful, please consider citing our paper:
@inproceedings{placeholder2026nca,
title={Training Language Models via Neural Cellular Automata},
author={Seungwook Han and Dan Lee and Akarsh Kumar and Pulkit Agrawal},
booktitle={TBD,
year={2026}
}