我不认为AGI即将来临。
Why I don't think AGI is imminent

原始链接: https://dlants.me/agi-not-imminent.html

## 人工智能现状与通往真正智能的道路 OpenAI和Anthropic关于达到人类水平人工智能的说法引发了争论,但深入分析显示,仍然存在重大障碍。本文认为,当前的大型语言模型(LLM)虽然令人印象深刻,但缺乏基本的认知“基元”——例如理解物体恒存性、因果关系和数字——这些基元通过进化硬编码到脊椎动物大脑中。 LLM擅长语言中的统计模式识别,但由于语言*假定*了这些潜在基元,而不是明确定义它们,因此难以进行基本的推理。仅仅用更多的数据进行训练,甚至包括视频,也不够;当前的“世界模型”往往学习的是表面模式,而不是对物理定律的真正理解。 进步需要超越被动观察的数据,转向*具身*人工智能——通过与丰富的、多感官环境互动来学习的智能体。虽然DeepMind的SIMA 2和Dreamer 4等项目正在探索这一点,但它们大多依赖于预训练的语言模型或狭义定义的任务,未能弥合模拟动作与真正认知能力之间的差距。 最终,实现通用人工智能(AGI)可能需要超越当前Transformer模型的全新架构——可能包含反馈循环和符号推理——以及长期的、跨学科的研究努力。尽管投入了大量资金,最近的一项调查显示,大多数人工智能研究人员认为仅仅扩展当前方法无法实现AGI,这凸显了炒作与现实之间的巨大差距。

## AGI:它来了吗? - Hacker News 讨论总结 一个 Hacker News 帖子讨论了通用人工智能(AGI)的临近程度。许多评论者认为 AGI 比人们想象的更近,一些人甚至声称它*已经存在*于像 Opus 4.6 这样先进的 LLM 的能力中——尤其是在白领知识工作方面。 缺失的部分不是智能本身,而是“经过测试的编排层”,以有效地利用这些模型。 然而,其他人仍然持怀疑态度。人们对 LLM 缺乏对现实世界的理解(难以完成物理任务)以及它们倾向于自信地陈述不准确信息表示担忧。 一个争论点是定义 AGI——一些人建议将“超常 GDP 增长”作为基准,而另一些人则将其等同于普通人的通用智能(一个令人惊讶的低标准)。 一个反复出现的主题是 AGI 的目标不断变化,通常受资金利益驱动。 最终,许多人认为这场争论没有意义,主张简单地等待看看人工智能如何发展。
相关文章

原文

February 14, 2026

The CEOs of OpenAI and Anthropic have both claimed that human-level AI is just around the corner — and at times, that it's already here. These claims have generated enormous public attention. There has been some technical scrutiny of these claims, but critiques rarely reach the public discourse. This piece is a sketch of my own thinking about the boundary of transformer-based large language models and human-level cognition. I have an MS degree in Machine Learning from over a decade ago, and I don't work in the field of AI currently, but I am well-read on the underlying research. If you know more than I do about these topics, please reach out and let me know, I would love to develop my thinking on this further.

Research in evolutionary neuroscience has identified a set of cognitive primitives that are hardwired into vertebrate brains: some of these are a sense of number, object permanence, causality, spatial navigation, and the ability to distinguish animate from inanimate motion. These capacities are shared across vertebrates, from fish to ungulates to primates, pointing to a common evolutionary origin hundreds of millions of years old.

Language evolved on top of these primitives — a tool for communication where both speaker and listener share the same cognitive foundation. Because both sides have always had these primitives, language takes them for granted and does not state them explicitly.

Consider the sentence "Mary held a ball." To understand it, you need to know that Mary is an animate entity capable of intentional action, that the ball is a separate, bounded, inanimate object with continuous existence through time, that Mary is roughly human-sized and upright while the ball is small enough to fit in her hand, that her hand exerts an upward force counteracting gravity, that the ball cannot pass through her palm, that releasing her grip would cause the ball to fall, and that there is one Mary and one ball, each persisting as the same entity from moment to moment, each occupying a distinct region of three-dimensional space. All of that is what a human understands from four words, and none of it is in the text. Modern LLMs are now trying to reverse-engineer this cognitive foundation from language, which is an extremely difficult task.

I find this to be useful framing for understanding many of the observed limitations of current LLM architectures. For example, transformer-based language models can't reliably do multi-digit arithmetic because they have no number sense, only statistical patterns over digit tokens. They can't generalize simple logical relationships — a model trained on "A is B" can't infer "B is A" — because they lack the compositional, symbolic machinery.

One might object: modern AIs are now being trained on video, not just text. And it's true that video prediction can teach something like object permanence. If you want to predict the next frame, you need to model what happens when an object passes behind an occluder, which is something like a representation of persistence. But I think the reality is more nuanced. Consider a shell game: a marble is placed under one of three cups, and the cups are shuffled. A video prediction model might learn the statistical regularity that "when a cup is lifted, a marble is usually there." But actually tracking the marble through the shuffling requires something deeper — a commitment to the marble as a persistent entity with a continuous trajectory through space. That's not merely a visual pattern.

The shortcomings of visual models align with this framing. Early GPT-based vision models failed at even basic spatial reasoning. Much of the recent progress has come from generating large swaths of synthetic training data. But even in this, we are trying to learn the physical and logical constraints of the real world from visual data. The results, predictably, are fragile. A model trained on synthetic shell game data could probably learn to track the marble. But I suspect that learning would not generalize to other situations and relations — it would be shell game tracking, not object permanence.

Developmental psychologist Elizabeth Spelke's research on "core knowledge" has shown that infants — including blind infants — represent objects as bounded, cohesive, spatiotemporally continuous entities. This isn't a learned visual skill. It appears to be something deeper: a fundamental category of representation that the brain uses to organize all sensory input. Objects have identity. They persist. They can't teleport or merge. This "object-ness" likely predates vision itself — it's rooted in hundreds of millions of years of organisms needing to interact with things in the physical world, and I think this aspect of our evolutionary "training environment" is key to our robust cognitive primitives. Organisms don't merely observe reality to predict what happens next. They perceive in order to act, and they act in order to perceive. Object permanence allows you to track prey behind an obstacle. Number sense lets you estimate whether you're outnumbered. Logical composition enables tool construction and use. Spatial navigation helps you find your way home. Every cognitive primitive is directly linked to action in a rich, multisensory, physical world.

As Rodney Brooks has pointed out, even human dexterity is a tight coupling of fine motor control and rich sensory feedback. Modern robots do not have nearly as rich of sensory information available to them. While LLMs have benefited from vast quantities of text, video, and audio available on the internet, we simply don't have large-scale datasets of rich, multisensory perception coupled to intentional action. Collecting or generating such data is extremely challenging.

What if we built simulated environments where AIs could gather embodied experience? Would we be able to create learning scenarios where agents could learn some of these cognitive primitives, and could that generalize to improve LLMs? There are a few papers that I found that poke in this direction.

Google DeepMind's SIMA 2 is one. Despite the "embodied agent" branding, SIMA 2 is primarily trained through behavioral cloning: it watches human gameplay videos and learns to predict what actions they took. The reasoning and planning come from its base model (Gemini Flash-Lite), which was pretrained on internet text and images — not from embodied experience. There is an RL self-improvement stage where the agent does interact with environments, but this is secondary; the core intelligence is borrowed from language pretraining. SIMA 2 reaches near-human performance on many game tasks, but what it's really demonstrating is that a powerful language model can be taught to output keyboard actions.

Can insights from world-model training actually transfer to and improve language understanding? DeepMind's researchers explicitly frame this as a trade off between two competing objectives: "embodied competence" (acting effectively in 3D worlds) and "general reasoning" (the language and math abilities from pretraining). They found that baseline Gemini models, despite being powerful language models, achieved only 3-7% success rates on embodied tasks — demonstrating that embodied competence is not something that emerges from language pretraining. After fine-tuning on gameplay data, SIMA 2 achieved near-human performance on embodied tasks while showing "only minor regression" on language and math benchmarks. But notice the framing: the best case is that embodied training doesn't hurt language ability too much. There's no evidence that it improves it. The two capabilities sit in separate regions of the model's parameter space, coexisting but not meaningfully interacting. LLMs have billions of parameters, and there is plenty of room in those weights to predict language and to model a physical world separately. Bridging that gap — using physical understanding to actually improve language reasoning — remains undemonstrated.

DeepMind's Dreamer 4 also hints at this direction. Rather than borrowing intelligence from a language model, Dreamer 4 learns a world model from gameplay footage, then trains an RL agent within that world model through simulated rollouts where the agent takes actions, observes consequences provided by the world model, and updates its policy. This is genuinely closer to perception-action coupling: the agent learns through acting. However, the goal of this research is not general intelligence — it's sample-efficient control for robotics. The agent is trained and evaluated on predefined task milestones (get wood, craft pickaxe, find diamond), scored by a learned reward model. Nobody has tested whether the representations learned through this sort of training generalize to reasoning, language, or anything beyond the specific control tasks they were trained on. The gap between "an agent that learns to get diamonds in Minecraft through simulated practice" and "embodied experience that produces transferable cognitive primitives" is enormous and entirely unexplored.

As far as I understand, we don't know how to:

  • embed an agent in a perception-action coupled training environment

  • create an objective and training process that leads it to learn cognitive primitives like spatial reasoning or object permanence

  • leverage this to improve language models or move closer to general artificial intelligence

Recent benchmarking work underscores how far we are. Stanford's ENACT benchmark (2025) tested whether frontier vision-language models exhibit signs of embodied cognition — things like affordance recognition, action-effect reasoning, and long-horizon memory. The results were stark: current models lag significantly behind humans, and the gap widens as tasks require longer interaction horizons.

In short: world models are a genuinely exciting direction, and they could be the path to learning foundational primitives like object permanence, causality, and affordance. But this work is still in the absolute earliest stages. Transformers were an incredible leap forward, which is why we can now have things like the ENACT benchmark which better illustrate the boundaries of cognition. I think this area is really promising, but research in this space could easily take decades.

I will also mention that the most prominent "world model" comes from Yann LeCun, who recently left Meta to start AMI Labs. His Joint Embedding Predictive Architecture (JEPA) is a representation learning method: it trains a Vision Transformer on video data, masking parts of the input and predicting their abstract representations rather than their raw pixels. The innovation is predicting in representation space rather than input space, which lets the model focus on high-level structure and ignore unpredictable low-level details. This is a genuine improvement over generative approaches for learning useful embeddings. But despite the "world model" branding, JEPA's actual implementations (I-JEPA, V-JEPA, V-JEPA 2) are still training on passively observed video — not on agents embedded in physics simulations. There is no perception-action coupling, no closed-loop interaction with an environment. JEPA is a more sophisticated way to learn from observation, but by the logic of the argument above, observation alone is unlikely to yield the cognitive primitives that emerge from acting in the world.

The ARC-AGI benchmark offers an important illustration of where these primitives show up. ARC tasks are grid-based visual puzzles that test abstract reasoning: spatial composition, symmetry, relational abstraction, and few-shot generalization. They require no world knowledge or language — just the ability to infer abstract rules from a handful of examples and apply them to novel cases. Humans solve these tasks trivially, usually in under two attempts. When ARC-AGI-2 launched in March 2025, pure LLMs scored 0% and frontier reasoning systems achieved only single-digit percentages. By the end of the year, refinement-loop systems — scaffolding that wraps a model in iterative generate-verify-refine cycles — pushed scores to 54% on the semi-private eval and as high as 75% on the public eval using GPT-5.2, surpassing the 60% human average. But the nature of this progress matters as much as the numbers.

The nature of this progress is telling: the top standalone model without refinement scaffolding — Claude Opus 4.5 — scores 37.6%. It takes a refinement harness running dozens of iterative generate-verify-refine cycles at $30/task to push that to 54%, and a combination of GPT-5.2's strongest reasoning mode plus such a harness to reach 75%. This is not behavior that comes out of the core transformer architecture — it is scaffolded brute-force search, with each percentage point requiring substantially more compute. The ARC Prize Grand Prize at 85% remains unclaimed.

ARC is important because it illustrates the kind of abstract reasoning that seems central to intelligence. For humans, these capabilities arose from embodied experience. It's conceivable that training methods operating in purely abstract or logical spaces could teach an agent similar primitives without embodiment. We simply don't know yet. Research in this direction is just beginning, catalyzed by benchmarks like ARC that are sharpening our understanding of the boundary between what LLMs do and what intelligence actually requires. Notably, the benchmark itself is evolving in this direction ARC-AGI-3 introduces interactive reasoning challenges requiring exploration, planning, memory, and goal acquisition — moving closer to the perception-action coupling that I argue is central to intelligence.

It's worth addressing a common counterargument here: AI models have saturated many benchmarks in recent years, and we have to keep introducing new ones. Isn't this just moving the goalposts? I don't think this framing is true - benchmark saturation is exactly how we learn what a benchmark was actually measuring. Creating different benchmarks in response is not goalpost-moving — it's the normal process of refining our instruments and understanding. The "G" in AGI stands for "general" — truly general intelligence should transfer from one reasoning task to another. If a model had genuinely learned abstract reasoning from saturating one benchmark, the next benchmark testing similar capabilities should be easy, not devastating. The fact that each new generation of benchmarks consistently exposes fundamental failures is itself evidence about the nature of the gap. The ARC benchmark series illustrates this well: the progression from ARC-AGI-1 to ARC-AGI-3 didn't require heroic effort to find tasks that stump AI while remaining easy for humans - it just required refining the understanding of where the boundary lies. Tasks that are trivially easy for humans but impossible for current models are abundant (see multi-digit arithmetic, above). The benchmark designers aren't hunting for exotic edge cases; they're mapping a vast territory of basic cognitive capability that AI simply doesn't have.

The transformer architectures powering current LLMs are strictly feed-forward. Information flows from tokens through successive layers to the output, and from earlier tokens to later ones, but never backward. This is partly because backpropagation — the method used to train neural networks — requires acyclic computation graphs. But there's also a hard practical constraint: these models have hundreds of billions of parameters and are trained on trillions of tokens, and rely heavily on reusing computation. When processing token N+1, an LLM reuses all the computation from tokens 1 through N (a technique called KV caching). This is what makes training and inference tractable at scale. But it also means the architecture is locked into a one-directional flow — processing a new token can never revisit or revise the representations of earlier ones. Any architecture that allowed backward flow would compromise this caching, requiring novel computational techniques to make it tractable at scale.

Human brains function in a fundamentally different way. The brain is not a feed-forward pipeline. Activations reverberate through recurrent, bidirectional connections, eventually settling into stable patterns. For every feedforward connection in the visual cortex, there is a reciprocal feedback connection carrying contextual information back to earlier processing stages. When you recognize a face, it's not the output of a single forward pass — it's the result of distributed activity that echoes back and forth between regions until the system converges on an interpretation.

This is not to say that the human brain architecture is necessary to reach general intelligence. But the contrast helps contextualize just how constrained current LLM architectures are. There's a growing body of peer-reviewed theoretical work formalizing these constraints. Merrill and Sabharwal have shown that fixed-depth transformers with realistic (log-precision) arithmetic fall within the complexity class TC⁰ — which means they provably cannot recognize even regular languages or determine whether two nodes in a graph are connected. These are formally simple problems, well within the reach of basic algorithms, that transformers provably cannot solve in a single forward pass. This isn't an engineering limitation to be overcome with more data or compute — it's a mathematical property of the architecture itself. And Merrill and Sabharwal go further, arguing that this is a consequence of the transformer's high parallelizability: any architecture that is as parallelizable — and therefore as scalable — will hit similar walls.

What might alternative architectures look like? Gary Marcus has long advocated for other approaches, like neurosymbolic AI — hybrid systems that combine neural networks with explicit symbolic reasoning modules for logic, compositionality, and variable binding. I think that neural architectures with feedback connections — networks that are not strictly feed-forward but allow information to flow backward and settle into stable states — could learn to represent cognitive primitives. The challenge, as discussed above, is that such architectures break the computational shortcuts that make current transformers trainable and deployable at scale. In either case, getting neurosymbolic, recurrent or bidirectional neural networks to work at the scale of modern LLMs is an open engineering and research problem.

Most people encounter AGI through CEO proclamations. Sam Altman claims that OpenAI knows how to build superintelligent AI. Dario Amodei writes that AI could be "smarter than a Nobel Prize winner across most relevant fields" by 2026. These are marketing statements from people whose companies depend on continued investment in the premise that AGI is imminent. They are not technical arguments.

Meanwhile, the actual research community tells a different story. A 2025 survey by the Association for the Advancement of Artificial Intelligence (AAAI), surveying 475 AI researchers, found that 76% believe scaling up current AI approaches to achieve AGI is "unlikely" or "very unlikely" to succeed. The researchers cited specific limitations: difficulties in long-term planning and reasoning, generalization beyond training data, causal and counterfactual reasoning, and embodiment and real-world interaction. This is an extraordinary disconnect.

Consider the AI 2027 scenario, perhaps the most widely-discussed AGI forecast of 2025. The underlying model's first step is automating coding, which is entirely based on an extrapolation of the METR study on coding time horizons. The METR study collects coding tasks that an AI can complete with a 50% success rate, and tracks how the duration of those tasks grows over time. But task duration is not a measure of task complexity. As the ARC-AGI benchmarks illustrate, there are classes of problems that take humans only seconds to solve but that require AI systems thousands of dollars of compute and dozens of iterative refinement cycles to approach — and even then, the 85% Grand Prize threshold remains unmet. The focus on common coding tasks strongly emphasizes within distribution tasks, which are well-represented within the AI training set. The 50% success threshold also allows one to ignore precisely the tricky, out of distribution, short tasks that agents may not be making any progress on at all. The second step within the 2027 modeling is agents developing "research taste". My take is that research taste is going to rely heavily on the short-duration cognitive primitives that the ARC highlights but the METR metric does not capture.

I'd encourage anyone interested in this topic to seek out technical depth. Understand what these systems actually can and can't do. The real story is fascinating - it's about the fundamental nature of intelligence, and how far we still have to go to understand it.

Betting against AI is difficult currently, due to the sheer amount of capital being thrown at it. One thing I've spent a lot of time thinking about is — what if there's a lab somewhere out there that's about to crack this? Maybe there are labs — even within OpenAI and Anthropic themselves — that are already working on all of these problems and keeping them secret?

But the open questions described above are not the kind of problem a secret lab can solve. They are long-standing problems that span multiple different fields — embodied cognition, evolutionary neuroscience, architecture design and complexity theory, training methodology and generalizability. Solving problems like this requires a global research community working across disciplines over many years, with plenty of dead ends along the way. This is high-risk, low-probability-of-reward, researchers-tinkering-in-a-lab kind of work. It's not a sprint towards a finish line.

This also helps us frame what AI companies are actually doing. They're buying up GPUs, building data centers, expanding product surface area, securing more funding. They are scaling up the current paradigm, which doesn't really have bearing on the fundamental research that can make progress in the problems highlighted above.

I'm not saying that AGI is impossible, or even that it won't come within our lifetime. I fully believe neural networks, using appropriate architectures and training methods, can represent cognitive primitives and reach superhuman intelligence. They can probably do this without repeating our long evolutionary history, by training in simulated logical / symbolic simulations that have little to do with the physical world. I am also not saying that LLMs aren't useful. Even the current technology is fundamentally transforming our society (see AI is not mid - a response to Dr. Cottom’s NYT Op-Ed)

We have to remember though that neural networks have their origins in the 1950's. Modern backpropagation was popularized in 1986. Many of the advances that made modern GPTs possible were discovered gradually over the following decades:

  • Long Short-Term Memory (LSTM) networks, which solved the vanishing gradient problem for sequence modeling — Hochreiter and Schmidhuber, 1997

  • Attention mechanisms, which allowed models to dynamically focus on relevant parts of their input — Bahdanau et al., 2014

  • Residual connections (skip layers), which made it possible to train networks hundreds of layers deep — He et al., 2015

  • The transformer architecture itself, which combined attention with parallelizable training to replace recurrent networks entirely — Vaswani et al., 2017

Transformers have fundamental limitations. They are very powerful, and they have taught us a lot about what general intelligence is. We are gaining a more and more crisp understanding of where the boundaries lie. But solving these problems will require research, which is a non-linear processs full of dead ends and plateaus. It could take decades, and even then we might discover new and more nuanced issues.

联系我们 contact @ memedata.com