GPT 在乔姆斯基层次结构中位于哪里？

GPT 在乔姆斯基层次结构中位于哪里？
Where Is GPT in the Chomsky Hierarchy?

本文探讨了生成式预训练变换器（GPT）在乔姆斯基层次结构中的位置，这是一个语言表达能力的分类。尽管GPT展现了令人印象深刻的语言能力，但作者认为它们不如图灵机——最具表达力的级别——强大，尽管最初有这样的假设。核心论点集中在变换器固有的局限性上。即使拥有无限的上下文窗口，有限的词汇量和连续运算的边界性质意味着GPT *必须* 停止，从而阻止了真正的图灵完备性。与循环神经网络不同，变换器无法执行无界计算。这种局限性并非缺陷，而是一种*特性*。正是这种有界的计算路径使得GPT的可并行化训练成为可能。作者认为这种表达能力权衡是理解关于GPT完全自动化人类劳动的潜力的争论的关键，暗示着真正可泛化的AI可能需要能够进行无界计算的架构。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 GPT 在乔姆斯基层次结构中处于什么位置？ (fi-le.net) 8 分，由 fi-le 1 小时前发布 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

14th of December, 2025

The Chomsky hierarchy is a way to classify text-generating algorithms (formally called languages) by how expressive they are. Since generative pretrained transformers, GPTs, are getting quite a bit of attention these days, one might wonder where in the hierarchy they fall. To give a classic example of one category in the hierarchy, context-free languages are those whose strings are produced by iteratively applying rules that locally expand grammar particles into words. This covers practically all programming languages, because they conform to a syntax tree where each branch is an application of such a production rule. A subcategory of context-free languages are n-gram languages, where the possible continuations of a string depend only on a bounded number of words that came before in the sentence (whereas the syntax tree of a context-free language could have unboundedly many branches).

Human language is of course much more complicated than these simple classes of languages. One argument for this could be that humans are generally intelligent, and can therefore simulate a Turing machine, which at least in principle gives us the power to understand any computable language. This is a strictly more expressive category of all the languages we talked about above. A more prosaic argument might describe concrete kinds of sentences that provably cannot be produced by, say, context-free grammars, which I will leave to linguistics textbooks.

Surely then, a system like GPT, that understands human language so well that it can write perfect sonnets and win math competitions, must also be in this same rung in the Chomsky hierarchy as humans, by being Turing complete?

First, we could be cheeky and say that, because GPT has a finite context window with finitely many possible tokens at each position, it can trivially only model n-gram languages. This is true, but not really in the spirit of the question. By the same logic, humans would be finite-state machines because the molecules in our brains within a finite space can only occupy finitely many combinations of Planck volumes. Let's grant our GPT an infinite context-window.

Conversely, some assumptions are too generous to describe real-world GPTs. For example, Pérez, Barceló, and Marinkovic (2021) prove that transformers are in fact Turing complete if we allow adding new layers as the input sequence length grows.

There is a simple argument showing that GPTs, even with an infinite context window, cannot be Turing complete. Since the vocabulary size is finite, the rows of the embedding matrix are bounded, meaning each embedding is in a compact subset of Rⁿ. By Tychonoff's Theorem, the product of the compact spaces the input embeddings live in is also compact. Because the transformer architecture does a bounded number of continuous operations, the output probabilities are bounded from above and below. In particular, the probability of emitting the end-of-sequence token (the special token that halts the completion) is bounded from below by some constant, so the probability of stopping in finite time is 1.

Some readers will have noticed the above argument describes a slightly different GPT from that which is actually deployed, but the intuition does apply to transformers in general. While other architectures like RNNs can do an unbounded number of computations for inference, transformers cannot do so, even in theory. In practice, transformers rely on strange crutches like thinking tokens (e.g. Deepseek-r1 likes to output: "But wait, …") to do what might be called approximating Turing completeness. Similar but more involved arguments prove that commonly used GPT architectures without chain-of-thought cannot learn a generalizable algorithm for the most simple problems, such as telling apart odd numbers of symbols ("yyy") from even ones ("yyyy") (Hahn 2020).

But please do not get any ideas about going back to recurrent networks. This limitation of transformers is precisely the reason why GPTs are so easy to train, and lesser expressiveness of the transformer is a feature, not a bug. The fact that the length of the computational path between the input tokens and the output probabilities is bounded is what allows parallelized training (see also Merrill and Sabharwal 2023).

I claim this awkwardness is central to the disagreement between those who believe GPTs are sufficient for automating most of current human labor, and those who believe only a fundamentally different architecture is needed. Maybe the Chomsky hierarchy is just a bad way to think about computation, GPTs are fine, and we humans also only ever do fixed-depth approximations to Turing completeness. But maybe unbounded computation path lengths are needed to learn algorithms that generalize like we can. It could be the parallel peril that trumpets the end of the scaling paradigm.

GPT 在乔姆斯基层次结构中位于哪里？ Where Is GPT in the Chomsky Hierarchy?

GPT 在乔姆斯基层次结构中位于哪里？
Where Is GPT in the Chomsky Hierarchy?