Lean 证明了这个程序的正确性；然后我发现了一个错误。

Lean 证明了这个程序的正确性；然后我发现了一个错误。
Lean proved this program correct; then I found a bug

原始链接: https://kirancodes.me/posts/log-who-watches-the-watchers.html

人工智能代理的最新进展正在大幅降低发现软件漏洞的成本，可能引发一场广泛的“软件危机”，因为大多数现有代码并非为如此密集的审查而设计。这导致人们对形式化验证的兴趣增加——使用机械工具证明代码的正确性。最近在Lean生态系统内取得了一项重要里程碑：10个AI代理自主构建并*验证*了一个完整的zlib实现，名为“lean-zip”，保证它没有实现错误。然而，随后的模糊测试（使用另一个AI代理Claude）揭示了一个令人惊讶的结果。虽然lean-zip本身在超过1.05亿次测试后被证明是安全的，但发现了一个缓冲区溢出——并非在经过验证的代码中——而是在Lean 4运行时中，影响所有版本。此外，lean-zip的未经验证的归档解析器中发现了一个拒绝服务漏洞。这表明，即使是经过形式化验证的代码也依赖于可能仍然存在漏洞的基础系统，并且验证工作必须超出核心应用程序的范围。

一个黑客新闻的讨论强调了使用 Lean 和 Rocq 等工具进行形式化验证的挑战。虽然这些工具可以*证明*代码的正确性，但 Lean 验证了一个程序后发现的错误表明了一个关键点：**证明验证代码与*规范*一致，不一定与预期行为一致。** 评论者和原始帖子作者都经历了这种情况——经过验证的代码包含了一种拒绝服务漏洞，该漏洞源于程序规范中的缺陷。这说明，如果最初的规范不能准确反映期望的结果，完美的证明并不能保证程序没有错误。令人担忧的是，如果规范没有经过严格审查，依赖于证明助手来验证人工智能的正确性可能会产生虚假的安全性，从而导致经过验证的代码*技术上*符合规范，但却不能实现人类*想要*的功能。

原文

I fuzzed a verified implementation of zlib and found a buffer overflow in the Lean runtime.

AI agents are getting very good at finding vulnerabilities in large-scale software systems.

Anthropic, was apparently so spooked by the vulnerability-discovery capabilities of Mythos, they decided not to release it as it was "too dangerous" (lol). Whether you believe the hype about these latest models or not, it seems undeniable that the writing is on the wall:

The cost of discovering security bugs is collapsing, and the vast majority of software running today was never built to withstand that kind of scrutiny. We are facing a looming software crisis.

In the face of this oncoming tsunami, recently there has been increasing interest in formal verification as a solution. If we state and prove properties about our code using a mechanical tool, can we build robust, secure and stable software that can overcome this oncoming barrage of attacks?

One recent development in the Lean ecosystem has taken steps towards this question. 10 agents autonomously built and proved an implementation of zlib, lean-zip, an impressive landmark result. Quoting from Leo De Moura, the chief architect of the Lean FRO (here):

With apologies for the AI-slop (Leo has a penchant for it, it seems), the key result is that lean-zip is not just another implementation of zlib. It is an implementation that has been verified as correct end to end, guaranteed by Lean to be entirely free of implementation bugs.

What does "verified as correct" actually look like? Here is one of the main theorems (github):

theorem zlib_decompressSingle_compress (data : ByteArray) (level : UInt8)
   (hsize : data.size < 1024 * 1024 * 1024) :
    ZlibDecode.decompressSingle
      (ZlibEncode.compress data level) = .ok data

For any byte array less than 1 gigabyte, calling ZlibDecode.decompressSingle on the output of ZlibEncode.compress produces the original data. The decompress function is exactly the inverse of compression. This pair of functions is entirely correct.

Or is it?

I pointed a Claude agent at lean-zip over a weekend, armed with AFL++, AddressSanitizer, Valgrind, and UBSan. Over 105 million fuzzing executions, it found:

Zero memory vulnerabilities in the verified Lean application code.
A heap buffer overflow in the Lean 4 runtime (lean_alloc_sarray), affecting every version of Lean to date. (bug report, pending fix)
A denial-of-service in lean-zip's archive parser, which was never verified.

Lean 证明了这个程序的正确性；然后我发现了一个错误。 Lean proved this program correct; then I found a bug

Lean 证明了这个程序的正确性；然后我发现了一个错误。
Lean proved this program correct; then I found a bug