通过弹射实现类人神经网络

原文

What all of these anomalies seem to share is a core of scaling-law-like relationship of parameters, memorization, generalization, and training—a multi-way bias-variance tradeoff, where different systems hit different points on a Pareto frontier where NNs have low variance in-sample but then high bias out-of-sample or on hard problems, and biological brains are at the other extreme (with an unhappy valley of intermediates which impress no one).

I suggest that the core insight here is that too-extensive memorization is the enemy of abstraction, by leading a model to a local optimum which minimizes error but encodes fundamentally the wrong algorithm. Instead, we must, paradoxically, defy intuitions about overfitting by training as large a model as possible in order to handle as small data as available to a human without overfitting.

What happens when we train a current DL model is that it is lazy, and so it rapidly homes in on the nearest loss local minimum it can find. This minimum tends to be one where it has highly-efficiently memorized the data and learned all the non-robust features and statistical shortcuts; this assemblage of tricks is genuinely effective and intelligent, and is not “cheating” (non-robust features really are present in the heldout data etc.), but they do not generalize far beyond that, because they have not hit on the true underlying algorithm or latent manifold.

Why do grokking NNs seem to need to “memorize to generalize”? Why might it be hard to make progress towards the true target before the memorization phase?

Perhaps because LLMs both memorize and forget datapoints constantly during training (which is why number of copies in the dataset matters: more likely to have seen it recently before the end)—but this forgetting happens for no good reason? “Repetition is the mother of learning” (which is why spaced repetition is good for abstraction, not just brute facts): it is difficult to generalize even the simplest proposition like A + B = C if a NN is constantly forgetting & relearning each part.

Grokking appears to proceed in two phases, first the memorization of all available training data, then the gradual development internally of an increasingly-refined generalizing algorithm.

With that in mind, we might describe the benefit of the memorization as the “learning facts” phase of pedagogy, and then the generalization phase as “the NN thinking about or pondering the facts it has learned until it gets it”. Each minibatch is another ‘thought’ about the data, as the NN struggles to understand the gestalt of the data as more than a bunch of brute facts, and the gradient descent slowly homes in on the generalizing algorithm. (And then memorizing too many facts can sabotage grokking by ‘squeezing out’ the generalizing algorithm because too much needs to be memorized—memorizing a few well-chosen examples is more useful than memorizing countless redundant pieces of trivia.)

That true target may be ‘very distant’ in the loss landscape, and getting there may require an exorbitant amount of data—each data point painfully pushing it ever so slightly out of its comfort zone until one day, it finally is forced by the overwhelming weight of long-tail anomalies to turn into the right model.

The right algorithm will lie in a distant part of the model loss-landscape, but to reach it using a reasonable amount of training data requires the model to travel far (as a kind of grokking/catapult/super-convergence), which is only possible if the model is so overparameterized that it can encode smooth paths (like saddle-points, as all models are linear mode connected seemingly) and it can ensemble over extremely large families of models which ‘blur out’ to a smooth abstraction of the posterior. And then high-learning-rate training is critical to kick it along the frequent saddle-points & plains that slow down optimization, oscillating between high learning rates to escape the current local optimum and lower learning rates to consolidate the gains and find the new escape route. These models inherently require long serial training with many time-steps, and cannot be easily ‘parallelized’ or absorb large amounts of information, and benefit from many long periods of inactivity (‘sleep’) to globally repair the damage from learning or ‘catch up’ on backlogged steps.

Such a model will memorize little of its training data, because that would require rigid, fragile, precise parameters—but those parameters all need to be recycled and explore strange new model loss-landscapes in order to eventually arrive at the promised land. So during training, and even afterwards, such a model will forget and perform badly on benchmarks that reward memorization (such as declarative knowledge)—even though it will avoid adversarial examples (because none of its boundaries are extremely low-parameter linear lines dependent on inhumanly high-frequency texture-biased non-robust features) and will generalize well to hard problems (which by definition make up little of the standard benchmarks) by learning all sub-skills and achieving ‘slingshot generalization’ by mastering all combinations of skills, and resolve many issues with contemporary DL.

Such models will be hard to discover because of the use of early stopping and the general greediness of DL training (and R&D in general), even though the core ideas are well-known and have large associated literatures: like the original DL scaling ideas, the payoff of catapulting simply takes too long to come.

This intelligence is not that hard to evolve, but it usually is not worthwhile. So they fall into an uneasy ecological niche: such robust generalization is not useful for many animals compared to simpler, cheaper, forms of learning, such as association or imitation, or hardwiring directly into genetics. It can only pay back its exorbitant costs if the environment rewards robust generalization appropriately by providing enough high-payoff opportunities which are predictable but not too predictable.

What Is the Largest Useful LLM?

LLMs are more sample-efficient the larger they are. At what parameter scale does this stop holding true? And why don’t we know the answer to this question already?

What would a ‘human-sized LLM’ (or HLLM) look like?

First, it would need to be highly overparameterized relative to the problem as a whole, in order to provide a smoothly-connected loss landscape. Second, because it is so overparameterized, standard hyperparameters will result in overfitting but not catapulting; heavy regularization will be required, and weight decay is probably not enough (as it doesn’t result in much model movement) but a high learning rate schedule may be adequate regularization given prior art like super-convergence & human parallels.

The final full-scale HLLM would probably look something like a dense Transformer or MLP which is multiple orders of magnitude larger than currently trained (so >10-trillion parameters, possibly >100-trillion), in order to be highly-overparameterized compared to the full distribution. (In grokking papers or the isoperimetry estimates, the NNs are generally several OOMs larger than ‘reasonable’, so if we ballpark GPT-4-level models at ~1t, we would weakly expect the catapulting regime to be ~100t.) The NN should probably be a “skinny” one, emphasizing depth over width; a long-standing trend in DL is that ‘wide’ networks tend to memorize more heavily and poorly express more algorithmic/computational reasoning, and that DL NNs tend to look rather un-biological in using so little recurrence/iterative computation or composing reasoning (while RNNs often do better than Transformers in specific cases, generalizing better despite their overall inferiority). Very deep networks tend to be avoided due to overfitting or instability rendering their theoretical advantages moot, but catapulting would potentially fix that, and benefit from the inductive biases. This is consistent with the most recent work on sample-efficient LLMs, like Kim et al 2025 or NanoGPT Slowrun, which emphasize increasing parameter size (eg. via ensembling) and heavy regularization over many epochs.

The NN is trained on small text corpuses like BabyLM scale (~0.1b words). One benefit here is that because the sample size has to be limited, that means it can be filtered extremely heavily for quality/deduplication: because the data distribution decides whether a model will meta-learn, the dataset should be as diverse as possible, and penalize memorization as much as possible (see Wang et al 2024). Each minibatch should sample as many distinct datapoints as possible, and likewise, diversity maximized across batches, so each catapulting step is catapulting for as many ‘skills’ or ‘capabilities’ simultaneously as possible. This will delay their learning, because only the bare minimum of data is available per skill, so there is no overkill, and they will tend to be learned simultaneously—which leads to ‘emergence’ as multi-step processes suddenly start becoming possible. (Because humans can replay memories, both in short-term and long-term memory and through the hippocampus, it would be reasonable to do multiple epochs if there is not enough high-quality diverse data for single-epoch training.)

The catapulting itself is due to a cyclical learning rate schedule like super-convergence, perhaps combined with heavy weight decay.

So, what would happen when training our oversized LLM on our highly-diverse memorization/repetition-purified data with cyclical schedules for heavy regularization? We would observe the classic cyclical training loss behaviors of spikes followed by reaching new lows, but with stagnant performance on the truly held-out data, as the LLM goes through the memorization phase, and eventually reaching a new regime where it begins to transition from memorization to generalization over many tasks simultaneously, which will then suddenly ‘emerge’. Each cycle builds up a new set of atomic skills, dependent on the skills learned in previous cycles (analogous to developmental phases).

Prototyping With Arithmetic

It might be feasible to test LLM catapulting on small-scale tasks where current LLMs clearly generalize poorly, like arithmetic. Arithmetic is about the smallest, simplest, easiest-to-generate problem that LLMs still fail in oddly brittle ways on, so it’s a great testbed.

Arithmetic is learnable with appropriate formatting by small cheap LLMs, but standard LLMs (trained on natural arithmetic text data) continue to not implement true arithmetic, even at the capability frontier like GPT-4, and arithmetic problems are easy to generate, benchmark, understand, and even do neural net interpretability on; so one could pilot catapulting on a pretrained LLM by looking for training schedules which make it find true arithmetic much faster than standard finetuning does.

More specifically, one would filter for ‘hard’ arithmetic problems and then search for catapult training recipes which reduce the exponent in the scaling law compared to the ‘standard’ training recipe. If one used regular arithmetic problems, the gain on rare hard problems—the sort which expose the fact that the LLM has only learned a collection of partial heuristics, approximations, and memorized answers—would be hopelessly masked by the average case. (It is entirely possible that before perfect arithmetic performance generalization, a catapulted LLM, which has mostly succeeded in learning true arithmetic, would be outperformed on average by the regular LLM which has memorized as much as possible.)

So one would do something like filtering stringently for the 0.1% (since arithmetic is so big) of the hardest arithmetic problems (as evaluated by an existing LLM or by testing for generalization past n digits), and then use that as the heldout data that one runs scaling law sweeps on for all training recipes.

The scaling laws would ignore the average-case performance from the training runs, and also the constant factor on the hard data, and look for changes in the exponent of the scaling laws for the hard data.

Ideally, one would find something like a training recipe where after many epochs, the catapulted small LLMs are improving more rapidly on the hard data than the standard LLMs are, and that even if the catapulted LLMs are substantially worse everywhere, the more rapid improvement means that at some point “the curves cross”, and the catapulted LLMs are superior. This would then be proof of concept that catapulting is not merely possible on a more complex problem like arithmetic that continues to challenge even SOTA LLMs, but that it changes the scaling laws.

Then to verify this result, one could apply interpretability research to it (like Zhong et al 2023): the final catapulted LLM should clearly express a valid arithmetic algorithm where the standard LLM fails to, and there should be phase transitions across the catapulted LLM’s checkpoints from standard LLM-like pseudo-arithmetics to true arithmetic.

With this proof of concept in hand, one can work on further optimizing the catapult training, and start attempting to infer what catapult training methods might scale up to SOTA LLMs like a GPT-4.

MLPs

Why might MLPs be especially suited here? Because extreme regularization may help fix their persistent overfitting problems and provide superior scaling.

Use of sparsity like mixture-of-experts would tend to reduce the effective parameter-size and the connectivity of the NN landscape, and so would be somewhat risky. However an intriguing possibility is that catapulting might make fully-connected MLP architectures viable.

MLP architectures are much simpler, more general, parameter-efficient, more hardware-friendly than CNNs/Transformers, and look like the logical next candidate for Bitter-Lesson-ing the status quo NN architecture—but MLPs still fall far behind. Why?

The main reason seems to be that they are too powerful, and overfit. (They are to Transformers/CNNs as those architectures are to humans, one might say.) Zhao et al 2021 & Bachmann et al 2023 demonstrate that MLPs scale well and can be competitive if enough regularization (like bottleneck layers) is added. In particular, the more regularization (and consequently, generalization downstream), the more they learn sensible convolution-like features, rather extremely high-frequency & non-robust features.

However, it is still unclear what “regularization” would preserve all the MLP benefits without crippling the architecture—and catapulting fits the bill! The high LR & catapult trajectory would suppress those MLP pathologies the same way it suppresses the milder versions in Transformers. (See Liu et al 2022, and Fan et al 2024 on how narrow deep MLPs seem to grok in unusual & better ways; likewise, He et al 2024 on deeper Transformers.)

Hardware

A serial catapulting regime would render extremely large GPU-clusters much less useful, as they will not be able to step through each iteration much faster; it would instead place a high premium on more exotic hardware like Cerebras chips, which can execute a training step in a small fraction of the time, and hence, wallclock.

See main comment

Incidentally, the use of low-latency hardware would also open up more exotic neural net architectures like AUNN/IFNN.

Prototyping With Image Classification

For non-LLMs like CNNs/MLPs trained on CIFAR-10/CIFAR-100 or ImageNet-1k (MNIST being too trivial to be convincing), we would similarly expect more human-like images. In image classification, it’s harder to define ‘hard’ than arithmetic, and the standard NN accuracy is so high that simply filtering for errors might not yield enough ‘reasonable’ errors at this point.

So we would instead use one of the many ‘post-ImageNet’ image datasets designed to stress-test classifiers, like ImageNet-A, ImageNet-C, ImageNet-Hard, ImageNet-R, & ImageNet-Sketch. These would be used as the target in scaling law sweeps as described previously.

Adversarial Robustness

A possible alternative would be to look at adversarial robustness instead of a standard benchmark.

If the dimpled manifold thesis is correct, then I interpret it as predicting that: while HLLMs might be intrinsically robust per isoperimetry, so too might be tiny models, as long as they found the right manifold which did not require the “dimples” (ie. unprincipled, ad hoc, memorized tweaks to the decision boundaries to get the right answer). This true manifold would potentially be found by catapulting.

So if one successfully catapulted a small CNN, even on CIFAR-10, it might demonstrate adversarial robustness and generalization (rather than the usual small NN choice of either robustness or generalization). This might be especially the case with MLPs, and while an additional research risk, the efficiency of MLPs would allow extensive testing on just 1 GPU (eg. Bachmann et al 2023).

No theory of adversarial examples other than “non-robust features” & “dimpled manifold” predicts that small models might be adversarially robust if simply trained for a long time with an odd learning rate schedule, so any large improvement in adversarial robustness is an important finding.

Prior Art

There is essentially no research on training >10t-parameter LLMs, or cyclical LRs on large LLMs (as opposed to small GPT-2-scale ones).

Historically, this is due to a mix of academic fashion/prejudices and capability gains by smaller models. Work on training highly-overparameterized LLMs, or on the equiparameterization regime, was largely killed by the release of the Chinchilla paper, which provided the perfect excuse for everyone to immediately halt parameter-scaling (since it no longer led immediately to SOTA results), as they have always wanted to, for a mix of good and bad reasons. Similarly, the excuse of “optimizing for inference-optimality” by overtraining small models has become popular, by optimizing scaling laws which assume that the trained model will be naively deployed as-is, without the actual pruning, quantizing, & distilling everyone does anyway.

This means that the field of high-energy DL is wide-open: this proposal will be highly unpopular, it is much less likely than usual that anyone would independently investigate this direction, and they will be discouraged by poor preliminary results when the training runs appear to have simply failed (because of subtle bugs, poor hyperparameters, or simply inadequate training time to catapult into a region where better results can be benchmarked—assuming the right benchmarks are being used to begin with).

Benchmarking

This is such a different training regime that previous scaling law sweeps are inapplicable. Further, the goal here may be one that existing benchmarks actively mislead on. They test mostly common easy questions—the sort where ‘direct fit’-like thinking does best on, by definition.

So a major question here would be whether scaling laws should target perplexity as usual, or if they need to target a custom benchmark which tries to test human-like generalization rather than memorization.

Given the difficulty we have in constructing non-trivia-heavy benchmarks which existing LLMs can’t beat, this might be one of the hardest parts! But I suspect that after training a reasonable HLLM, possibly via trial-and-error, and interacting with it for a while to get an idea of how it acts qualitatively, the right metric might become more obvious.

The benchmarks might include adversarial robustness, hard negative mining (ie. the hardest problems that the best LLMs still get wrong), meta-learning, checking how much sets of models add to an ensemble’s accuracy, or use a metric which rewards performance while penalizing memorization of training data.

Capabilities

The final model should generalize much better—possibly achieving the Nyquist learner limit of perfectly modeling the true (non-dimpled?) latent manifold, and thereby constructively answering Rosenfeld’s question about how a band-limited Nyquist learner could be implemented in current NN architectures.

Per the lottery ticket hypothesis, once the true generalizing algorithm has been found, and has been further trained as desired (perhaps on some large trivia-heavy corpus), we can prune it down to a much smaller, faster, more feasible-to-use model, in an example of “train large, then compress”.

Given the capacity of even small models, it may be possible to finetune these smaller models on arbitrarily large amounts of data to exploit the scaling laws of transfer. Because they start with the human-like generalizing prior, they will learn & memorize the data as appropriate, but without the pathologies of ‘direct fit’ to the data as when trained from scratch on the same data—thereby achieving the best of both worlds.

Economic Implications

If catapulting LLMs is enough to close the gaps with humans and solve AGI, then their economics simply turn into a discussion of AGI; but what if they are much better, yet still far from AGI?

As of 2024, the economics of the largest SOTA LLMs are poor, because model capabilities can be so easily cloned by using the cheap APIs to create large training corpuses, enabling behavior-cloning of the superior models. This makes it possible to ‘clone’ the most expensive multi-billion-dollar LLM into a cheap or even FLOSS LLMs (justified by commoditize-your-complement dynamics), eroding margins within months of release—even if the cloned model is broadly inferior. AI scaling companies are trapped in a race to spend the most capital on ever-larger training runs & deploy dirt-cheap distilled versions to gain market-share before the cloners erode their edge, in the hope that they can achieve some network effect or ‘escape velocity’ (like creating AGI).

For a catapulted LLM, however, this would still be the case, but much more so. For catch-up players, catapulting eliminates data-constraints, but the compute-cost (and difficulty) of training a catapulted LLM may be insuperable: the training run itself might not be too expensive, but the trial-and-error and proprietary trade-secret knowledge of the special sauce of how exactly to make catapulting work may be extremely compute-expensive.

And they can continue to do the cloning as usual, but whereas in 2024, the cloned models are not that much inferior and share all the same weaknesses as the SOTA model they are cloning (suffering from adversarial examples, confabulations, bizarre simple errors and brittleness etc.), in a catapult scaling regime, this would not be the case: there would be a clear qualitative difference. In fact, the cloned models may even be superior on many benchmarks, because they are trained so heavily the ‘standard’ low-learning-rate way and enjoy all the benefits of that approach, but still disfavored by users, who can’t afford their unreliability, confabulations, and blandness.

However, the catapulters remain able to do all the usual tricks and can create their own superior ‘clone’ models. (And these clone models can themselves presumably be catapulted for regimes or tasks where there is insufficient data to train the low-learning-rate direct-fit way.) Given the economics of scaling DL, where the compute-cost can be dropped by extremely large amounts while amortizing the initial training cost over many users who provide high usage of GPUs, this further means the first-mover in catapulting potentially can drop prices enough to discourage competition.

So a catapult LLM creator, if the catapulted LLM has enough of a human-like edge in reliability & quality, may be able to maintain high margins much longer than 2024 LLMs do.

Alignment

Speculating further, on the premise of the above: most capability improvements do not help AI alignment as much. This is because they either are specific to a capability in a narrow domain, or they enhance broad capabilities but only in a ‘brittle’ non-generalizing way, which can be highly economically valuable but doesn’t help ‘alignment’ much because we are interested in alignment not for its current narrow in-domain but generalizing. We need the kind of alignment which doesn’t help a chatbot say socially-acceptable things today (or do the right things for the wrong reasons), but which would make chatbots in charge of civilization do socially-acceptable things in the future (and do the right things, for the right reasons, so they will keep on doing the right things indefinitely).

But this catapult or Nyquist learner may be the exception, because what it helps with is true generalization. A catapulted LLM trained on large amounts of morality-related text has not learned purely an assemblage of memorized fragments, heuristics, tricks, and statistical associations, with any underlying algorithm begrudgingly forced by scaling, or learned deception & situated reasoning to maximize a reward, or been adversarially selected by use of ‘interpretability’ techniques to learn to think in nonlinear opaque ways which do not raise any red flags; instead, it has learned the underlying value-manifold the hard way, like a highly-intelligent, grown, moral adult human has.

Even if the capability improvement turns out to be beaten by standard LLM scaling approaches (like simply brute-force annotating every error), true generalization would be invaluable to alignment.

It does not solve the alignment problem in generality—but it might provide a way to create a genuinely moral AI, and that is a good starting point.

Interpretability

Further, because the native neural net way of thinking is a large complicated pastiche of memorization & heuristics, while the overparameterized grokked LLM has distilled out an algorithmic core for the key tasks, such a catapulted LLM ought to be much more useful for interpretability work. We can more easily validate that they are genuinely moral AI.

Once the overparameterization has been removed, what is left should be natively much closer to simple, verifiable, interpretable, and extractable algorithms. These can then be formally analyzed & verified (possibly with the assistance of the genuinely-moral AI LLMs, which are still risky but not too risky if they are probably moral to begin with, and we confine them to carefully-checked formal outputs and tasks like algorithmic equivalence of a sparsified neural network and extracted sub-algorithms.)

通过弹射实现类人神经网络 Human-Like Neural Nets by Catapulting

通过弹射实现类人神经网络
Human-Like Neural Nets by Catapulting