14th of December, 2025
The Chomsky hierarchy is a way to classify text-generating algorithms (formally called languages) by how expressive they are. Since generative pretrained transformers, GPTs, are getting quite a bit of attention these days, one might wonder where in the hierarchy they fall. To give a classic example of one category in the hierarchy, context-free languages are those whose strings are produced by iteratively applying rules that locally expand grammar particles into words. This covers practically all programming languages, because they conform to a syntax tree where each branch is an application of such a production rule. A subcategory of context-free languages are n-gram languages, where the possible continuations of a string depend only on a bounded number of words that came before in the sentence (whereas the syntax tree of a context-free language could have unboundedly many branches).
Human language is of course much more complicated than these simple classes of languages. One argument for this could be that humans are generally intelligent, and can therefore simulate a Turing machine, which at least in principle gives us the power to understand any computable language. This is a strictly more expressive category of all the languages we talked about above. A more prosaic argument might describe concrete kinds of sentences that provably cannot be produced by, say, context-free grammars, which I will leave to linguistics textbooks.
Surely then, a system like GPT, that understands human language so well that it can write perfect sonnets and win math competitions, must also be in this same rung in the Chomsky hierarchy as humans, by being Turing complete?
First, we could be cheeky and say that, because GPT has a finite context window with finitely many possible tokens at each position, it can trivially only model n-gram languages. This is true, but not really in the spirit of the question. By the same logic, humans would be finite-state machines because the molecules in our brains within a finite space can only occupy finitely many combinations of Planck volumes. Let's grant our GPT an infinite context-window.
Conversely, some assumptions are too generous to describe real-world GPTs. For example, Pérez, Barceló, and Marinkovic (2021) prove that transformers are in fact Turing complete if we allow adding new layers as the input sequence length grows.
There is a simple argument showing that GPTs, even with an infinite context window, cannot be Turing complete. Since the vocabulary size is finite, the rows of the embedding matrix are bounded, meaning each embedding is in a compact subset of Rn. By Tychonoff's Theorem, the product of the compact spaces the input embeddings live in is also compact. Because the transformer architecture does a bounded number of continuous operations, the output probabilities are bounded from above and below. In particular, the probability of emitting the end-of-sequence token (the special token that halts the completion) is bounded from below by some constant, so the probability of stopping in finite time is 1.
Some readers will have noticed the above argument describes a slightly different GPT from that which is actually deployed, but the intuition does apply to transformers in general. While other architectures like RNNs can do an unbounded number of computations for inference, transformers cannot do so, even in theory. In practice, transformers rely on strange crutches like thinking tokens (e.g. Deepseek-r1 likes to output: "But wait, …") to do what might be called approximating Turing completeness. Similar but more involved arguments prove that commonly used GPT architectures without chain-of-thought cannot learn a generalizable algorithm for the most simple problems, such as telling apart odd numbers of symbols ("yyy") from even ones ("yyyy") (Hahn 2020).
But please do not get any ideas about going back to recurrent networks. This limitation of transformers is precisely the reason why GPTs are so easy to train, and lesser expressiveness of the transformer is a feature, not a bug. The fact that the length of the computational path between the input tokens and the output probabilities is bounded is what allows parallelized training (see also Merrill and Sabharwal 2023).
I claim this awkwardness is central to the disagreement between those who believe GPTs are sufficient for automating most of current human labor, and those who believe only a fundamentally different architecture is needed. Maybe the Chomsky hierarchy is just a bad way to think about computation, GPTs are fine, and we humans also only ever do fixed-depth approximations to Turing completeness. But maybe unbounded computation path lengths are needed to learn algorithms that generalize like we can. It could be the parallel peril that trumpets the end of the scaling paradigm.