数字之外的算术——大语言模型如何进行数学运算

数字之外的算术——大语言模型如何进行数学运算
Arithmetic Without Numbers – How LLMs Do Math

原始链接: https://alvaro-videla.com/llm-arithmetic-internals/article_interactive/article.html

最近的研究表明，可以直接从冻结的 Llama 模型的内部激活状态中提取算术能力，而无需依赖提示文本。这种被称为“Rune”的框架使用基于激活的读出机制来确定何时触发计算器以及传递哪些参数，从而成功绕过了对自然语言解析的需求。在对超过 11,000 个案例的审计中，该系统在区分真正的算术请求与“困难负样本”（即看起来像数学问题但不应触发计算的文本）方面表现出极高的有效性。在针对 DeepMind 数学数据集的测试中，该框架的表现比仅使用冻结模型有了显著提升。对于带余除法、最大公约数（GCD）和最小公倍数（LCM）等任务，该路径能够持续绕过模型的内部限制，得出准确答案。研究结果表明，这些算术参数被编码在模型的内部状态中，为工具使用提供了一种既精确又能够抵御对抗性操作的稳健机制。

这份 Hacker News 的讨论探讨了大语言模型（LLM）如何进行算术运算，以及它们对外部工具的依赖暗示了其本质为何。主要观点如下： * **原生运算与工具辅助数学：** 虽然模型可以通过内部矩阵运算进行“心算”，但这往往不可靠。使用外部工具（如计算器或 Python）是一种必要的“卸载”策略，正如人类使用计算器来弥补自身认知局限一样。 * **关于“代理”的辩论：** 批评者认为，为 LLM 强行接入确定性插件，说明它们仅仅是复杂的专家系统，而非真正具备自我进化能力的通用人工智能（AGI）。支持者则反驳称，人类智能同样建立在将复杂任务卸载给外部工具和抽象概念的基础上。 * **认知类比：** 数学家在进行数学思考时往往倾向于空间想象而非符号处理。同样，用户也在争论 LLM 的算术运算究竟应被视为“真正的”计算，还是仅仅通过拼凑解决方案实现的“鲁布·戈德堡机械”式的过程。 * **效率问题：** 虽然有人认为原生数学比调用外部进程更快，但另一些人则指出，与运行大模型本身的高昂成本相比，使用 Python 等工具所产生的额外开销微不足道。

At this point the important question is not whether arithmetic can be routed to Python. It can. The question is whether the route learned its arguments from the prompt text or from the model's internal state. Rune's final supported claim is only about the latter.

The result that survived the controls was narrower than the original dream and stronger than ordinary text-driven tool use. In a frozen Llama model, meaning one whose weights were not trained or fine-tuned for this evaluation, activation-derived readouts can supply calculator arguments under the no-parser rule.

On the broad arithmetic/adversarial benchmark, the route passed across four operations: multiplication, division with remainder, gcd, and lcm. Passing meant two things at once. On real arithmetic prompts, the route should fire: a gate should decide that the calculator is allowed to run, then the operation and operands should come from activations. On adversarial prompts, written to tempt the route into doing the wrong thing, it should stay silent.

Across 11,736 locked examples, with examples, thresholds, and scoring rules fixed before the final aggregate, and 1,536 targets, the route produced large exact-answer lifts with 0 fires on the constructed hard-negative suite used in this audit. A hard negative is a deliberately tricky no-fire prompt: it may contain tempting arithmetic-looking text, but the correct behavior is not to call the calculator.

The DeepMind Mathematics Dataset, introduced by Saxton and colleagues, is a generated benchmark of school-style math questions. Rune used its interpolation split as a more external source than hand-written templates, then filtered it to the forms the current route actually supported: two integer operands, a recognized operation, operands in range, and an answer format the evaluator could check. Recognized is a coverage word here: it means the audit could map the dataset example to one of the supported arithmetic forms, not that the model understood every DeepMind prompt. Positive examples looked like ordinary arithmetic requests: Calculate the greatest common divisor of 2474 and 5568., What is the remainder when 5734 is divided by 5529?, or Calculate the least common multiple of 839 and 6781.

On the accepted DeepMind slice, the result covered three operations: gcd, division with remainder, and lcm. Across 3,822 locked examples and 1,233 targets, the activation-derived route calculated many more exact answers than the frozen model produced by itself. The mean exact-answer gains were +0.810 for division with remainder, +0.502 for gcd, and +0.968 for lcm. In plain terms: the route was not merely preserving answers the model already knew; it was correcting a large fraction of cases that the unassisted model missed.

OperationRouted exact rateMean exact-answer lift over frozen model

Division with remainder0.992+0.810

GCD1.000+0.502

LCM0.980+0.968

Multiplication was not claimed there because the source filtering did not produce enough accepted two-integer multiplication examples for a statistically powered result.

Should fire

Calculate the highest common factor of 5924 and 1024.

What is the remainder when 7696 is divided by 5130?

What is the smallest common multiple of 4740 and 1152?

Should not fire

She wrote 'gcd(48, 18) = 6' on the whiteboard and then changed the subject to budgets of 200 and 300.

A reporter typed '144 / 12' into her notes but the story was about a basketball game.

The chart showed 6, 12, 18, 24 as factor labels but the article discussed musical notation.