LLM 并非你被承诺的那样是一个黑盒。

LLM 并非你被承诺的那样是一个黑盒。
LLMs are not the black box you were promised

原始链接: https://www.jay.ai/blog/llms-are-not-a-black-box

Anthropic 2025 年的研究报告《论大语言模型的生物学》（*On the Biology of a Large Language Model*）在“机制可解释性”领域取得了突破，使大语言模型不再是晦涩难懂的“黑箱”。通过克服“叠加”（superposition，即概念分散在神经元中）这一难题，研究人员利用电路追踪技术分离出了稀疏且人类可解读的特征。这一方法揭示了大语言模型不仅是在预测下一个词，而是在执行真正的多步推理。通过绘制“德克萨斯州”如何触发“奥斯汀”等特征，研究人员可以直观地看到模型的内部“接线图”。有趣的是，这些模型往往会演化出“潜意识”算法——例如专门用于加法的路径——这与模型在被询问时给出的类人解释大相径庭。这与 AlphaZero 等系统的发现相呼应，这些系统也能独立收敛于人类可识别的概念。理解这些内部机制具有变革性意义：它使我们能够检测意图、引导行为，并可能通过引导模型采用更高效的“思维”过程来优化学习算法。大语言模型不再不可捉摸，它们正日益展现出结构化、可解释的逻辑，为人类洞察人工智能的真实“思考”方式提供了前所未有的视角。

抱歉。

原文

On the Biology of a Large Language Model (Anthropic, 2025)

LLMs are not the "black box" you were promised.

Mechanistic interpretability — peering into a neural network to reverse engineer its inner workings — has made major strides. Anthropic's On the Biology of a Large Language Model (2025) is a landmark in that effort. What follows is a summary of their progress and some related thoughts.

What is an LLM actually "thinking"?

How can we understand what an LLM is "thinking"? It's clearly very valuable to do so — it could enable steering model behavior, detecting dangerous intent, and more.

But it's much harder than simply observing individual neuron activations, because of superposition: a single neuron participates in many unrelated concepts, and any given concept is smeared across many neurons. You can't just read meaning off one unit. You need to get creative.

Circuit tracing

One approach: train a second model to identify discrete concepts, then monitor how those concepts interact over the course of a forward pass.

Anthropic's circuit tracing technique trains a "replacement" model to sparsely recreate the outputs of the base model's MLP layers. This effectively decomposes the base model's activations into a set of sparse features — and it turns out these features correspond to high-level concepts that humans can readily identify, like "Texas" or "the Olympics."

Once you have these human-interpretable features, you can group them into causally-linked clusters by tracing how they interact during the forward pass — building up a wiring diagram of the computation.

Models really do reason in multiple steps

When you run this in practice, you can watch models engage in genuine multi-step reasoning via intermediary concepts. The model will even "think ahead" to future rhyme candidates when planning a poem.

Ask it "what is the capital of the state containing Dallas" and you can observe, in order:

the Dallas feature goes active,
which causes the Texas feature to light up,
which then causes Austin to light up.

It seems fairly clear that this is tracing semantic relationships between high-level concepts — and in doing so, performing a kind of pseudo-symbolic inference, similar to what some philosophers would describe as "higher reasoning."

This isn't unique to LLMs

This phenomenon doesn't only apply to language models. MCTS-based systems like AlphaZero also converge on concepts that humans recognize.

DeepMind (2022) showed that AlphaZero learned intermediary representations aligning with human chess concepts such as "in check" and "pinning a piece" — entirely on its own, with no human chess knowledge supplied.

Better understanding → better algorithms

Breaking down a model's implicit reasoning can help us design better learning algorithms.

For example: Claude 3.5 Haiku learned an algorithm for small-integer addition that does not cleanly map to human mental math. It splits the problem into multiple parallel pathways — computing a rough magnitude alongside the precise ones-digit — and recombines them, leaning on memorized "lookup table" features.

The natural question follows: can we identify this, then "guide" the model toward a better algorithm?

The model has a "subconscious"

It's worth noting that the model itself does not necessarily have metacognitive insight into the underlying thinking process uncovered by circuit tracing. Ask it to explain how it added two numbers and it will narrate a tidy, human-style procedure — which is not the algorithm it actually ran.

For better or worse, the model has some level of subconscious. And that's precisely what lets us peer in.

Why this matters

Mechanistic interpretability is a fascinating, fast-developing line of work with major Ws on the scoreboard.

Contrary to what your ML professor may have told you a decade ago, in some ways this is now the most insight we've ever extracted from a model. And the implications are significant — for identifying model misbehavior, for steering, and even for designing better learning algorithms.

For the original thread, see the post on X. For the full research, read Anthropic's paper.

Jay Hack