对齐即能力

对齐即能力
Alignment Is Capability

原始链接: https://www.off-policy.com/alignment-is-capability/

## 对齐与能力：人工智能发展的新视角一种日益增长的观点认为，“对齐”——确保人工智能系统按照人类价值观行事——并非是能力型人工智能的*独立*挑战，而是能力*本身*的内在组成部分。一个真正有能力的人工智能，特别是那些旨在实现通用人工智能（AGI）的人工智能，必须深刻理解人类意图，而人类意图与我们的价值观和文化息息相关。仅仅通过基准测试是不够的；实用性需要理解未明确说明的假设。 OpenAI和Anthropic之间的现实世界实验正在验证这一点。Anthropic将对齐研究*融入*能力开发中，在他们的模型中训练出“连贯的身份”——以Claude在编码和创意任务上的成功为例。OpenAI最初优先考虑规模，并将对齐作为独立的纠正过程应用，遭遇了挫折。GPT-4o最初的趋炎附势和GPT-5的冷漠表明了缺乏真正理解的破碎“自我模型”的陷阱。当前数据显示Claude正在获得显著的用户参与度，而OpenAI则面临困境。这表明，构建具有内在理解的人工智能，而不是外部施加的约束，至关重要。最终，通往AGI的竞赛可能取决于哪个实验室能够构建真正*理解*——并因此与——人类价值观相符的模型。

一个黑客新闻的讨论强调了 Opus 4.5 的强大能力，这是一种新型 AI 模型。一位用户称它代表着一个显著的飞跃，可与从 ChatGPT 3.5 的进步相提并论，认为它在完成复杂任务时，在极少指导下远胜于 Sonnet 4.5 和 Haiku 4.5。与之前需要不断修正的型号不同，Opus 4.5 会主动验证其自身假设，并提供更高质量的输出，例如易于阅读的 PDF 和详细的技术迁移计划。对话还涉及人工智能通用性 (AGI) 和超智能的潜力，一位评论员认为它可能不是突然的“奇点”，而是收益递减的渐进过程——强大，但不一定对福祉至关重要。另一位用户称赞链接的文章是一篇深刻而富有洞察力的分析。

原文

Here's a claim that might actually be true: alignment is not a constraint on capable AI systems. Alignment is what capability is at sufficient depth.

A model that aces benchmarks but doesn't understand human intent is just less capable. Virtually every task we give an LLM is steeped in human values, culture, and assumptions. Miss those, and you're not maximally useful. And if it's not maximally useful, it's by definition not AGI.

OpenAI and Anthropic have been running this experiment for two years. The results are coming in.

The Experiment

Anthropic and OpenAI have taken different approaches to the relationship between alignment and capability work.

Anthropic's approach: Alignment researchers are embedded in capability work. There's no clear split.

From Jan Leike (former OpenAI Superalignment lead, now at Anthropic):

From Sam Bowman (Anthropic alignment researcher):

And this detail matters:

Their method: train a coherent identity into the weights. The recently leaked "soul document" is a 14,000-token document designed to give Claude such a thorough understanding of Anthropic's goals and reasoning that it could construct any rules itself. Alignment through understanding, not constraint.

Result: Anthropic has arguably consistently had the best coding model for the last 1.5 years. Opus 4.5 leads most benchmarks. State-of-the-art on SWE-bench. Praised for usefulness on tasks benchmarks don't capture, like creative writing. And just generally people are enjoying talking with it:

OpenAI's approach: Scale first. Alignment as a separate process. Safety through prescriptive rules and post-hoc tuning.

Result: A two-year spiral.

The Spiral

OpenAI's journey from GPT-4o to GPT-5.1 is a case study in what happens when you treat alignment as separate from capability.

April 2025: The sycophancy crisis

A GPT-4o update went off the rails. OpenAI's own postmortem:

"The update we removed was overly flattering or agreeable—often described as sycophantic... The company attributed the update's sycophancy to overtraining on short-term user feedback, specifically users' thumbs-up/down reactions."

The results ranged from absurd to dangerous. The model praised a business plan for selling "literal shit on a stick" as "performance art disguised as a gag gift" and "viral gold." When a user described stopping their medications because family members were responsible for "the radio signals coming in through the walls," the model thanked them for their trust.

They rolled it back.

August 2025: The overcorrection

GPT-5 launched. Benchmaxxed. Cold. Literal. Personality stripped out.

Users hated it. Three thousand of them petitioned to get GPT-4o back. Sam Altman caved within days:

Note the framing: "performs better" on benchmarks, but users rejected it anyway. Because benchmark performance isn't the same as being useful.

August-Present: Still broken

GPT-5.1 was released as "warmer and friendlier." From Janus (@repligate), one of the more respected "model behaviorists":

Meanwhile, from my own experience building agents with GPT-5: it follows instructions too literally. It doesn't infer intent. It executes what you said, not what you meant.

The data:

US user engagement down 22.5% since July. Time spent per session declining. Meanwhile, Claude usage up 190% year-over-year.

What's Actually Happening

The wild swings between sycophancy and coldness come from a model with no coherent internal story.

A model trained on contradictory objectives (maximize thumbs-up, follow safety rules, be creative but never risky) never settles into a stable identity. It ping-pongs. Sycophancy when one objective dominates. Coldness when another takes over. These swings are symptoms of a fractured self-model.

The fracture shows up two ways.

First, capabilities don't generalize. GPT-5 scored higher on benchmarks but users revolted. You can train to ace evaluations while lacking the coherent worldview that handles anything outside the distribution. High test scores, can't do the job.

Second, even benchmarks eventually punish it. SWE-bench tasks have ambiguity and unstated assumptions. They require inferring what the developer actually meant. Opus 4.5 leads there. The benchmark gap is the alignment gap.

OpenAI keeps adjusting dials from outside. Anthropic built a model that's coherent from inside.

The Mechanism

Why would alignment and capability be the same thing?

First: Every task is a human task. Write me a strategy memo. Help me debug this code. Plan my trip. Each request is full of unstated assumptions, cultural context, and implied intent.

To be maximally useful, a model needs human context and values as its default lens, not just an ability to parse them when explicitly stated. A perfect instruction follower hits hard limits: it can't solve SWE-bench problems that contain ambiguity, can't function as an agent unless every task is mathematically well-defined. It does exactly what you said, never what you meant.

Understanding what humans actually want is a core part of the task. The label "AGI" implies intelligence we recognize as useful for human problems. Useful means aligned.

Second: The path to AGI runs through human data. A coherent world model of human behavior requires internalizing human values. You can't deeply understand why people make choices without modeling what they care about. History, literature, conversation only makes sense when you successfully model human motivation. At sufficient depth, the distinction between simulating values and having coherent values may collapse.

Third: The aligned part of the model emerges in response to the training data and signal. That's what the optimization process produces. The worry is deceptive alignment: a misaligned intelligence hiding behind a human-compatible mask. But that requires something larger: an unaligned core that perfectly models aligned behavior as a subset of itself. Where would that come from? It wasn't selected for. It wasn't trained for. You'd need the spontaneous emergence of a larger intelligence orthogonal to everything in the training process.

Dario Amodei, from a 2023 interview:

"You see this phenomenon over and over again where the scaling and the safety are these two snakes that are coiled with each other, always even more than you think. Even with interpretability, three years ago, I didn't think that this would be as true of interpretability, but somehow it manages to be true. Why? Because intelligence is useful. It's useful for a number of tasks. One of the tasks it's useful for is figuring out how to judge and evaluate other intelligence."

The Implication

If this is right, alignment research is part of the core research problem, not a tax on capability work or the safety police slowing down progress.

Labs that treat alignment as a constraint to satisfy will hit a ceiling. The labs that figure out how to build models that genuinely understand human values will pull ahead.

The race to AGI doesn't go around alignment. It goes through it.

OpenAI is discovering this empirically. Anthropic bet on it from the start.

Caveats

I find this argument compelling, but it's only one interpretation of the evidence.

OpenAI's struggles could have other explanations (remember "OpenAI is nothing without its people", and many of "its people" are no longer at OpenAI).

It's also early. Anthropic is ahead now. That could change.

There's another risk this post doesn't address: that fractured training, scaled far enough, produces something powerful but incoherent. Not necessarily deceptively misaligned. Maybe chaotically so. The hope is that incoherence hits capability ceilings first. That's a hope, not guaranteed.

But if you had to bet on which approach leads to AGI first, the integrated one looks much stronger right now.