三种正式验证代码在实践中可能出错的方式

三种正式验证代码在实践中可能出错的方式
Three ways formally verified code can go wrong in practice

原始链接: https://buttondown.com/hillelwayne/archive/three-ways-formally-verified-code-can-go-wrong-in/

## 程序员的新逻辑 - v0.12 版本发布与正确性解释《程序员的新逻辑》最新版本 (v0.12) 关注软件中“正确性”的细微含义，灵感来源于近期在经过形式化验证的代码中发现的错误——甚至包括著名的“可证明正确”的 Leftpad。核心问题在于，形式化方法中的“正确”并不等同于日常编程中理解的“无 bug”。形式化验证确认代码*符合特定规范*，而这些规范可能无法捕捉所有期望的行为。错误可能源于无效的证明、证明了*错误属性*（例如长度与视觉对齐），或依赖于不正确或未声明的关于环境或依赖代码的*假设*。至关重要的是，验证依赖于假设——关于内存、编译器、外部服务，甚至整数溢出的不存在——这些假设在部署后可能会改变。这意味着“可证明的正确性”并非一次性保证。理解这些局限性对于有效利用形式化方法以及向团队传达其范围至关重要。作者的新书《程序员的逻辑》现已进入抢先体验阶段，提供对这些概念的更深入探讨。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录正式验证代码在实践中出错的三种方式 (buttondown.com/hillelwayne) 7 分，by todsacerdoti 1小时前 | 隐藏 | 过去 | 收藏 | 2 评论 ip26 8分钟前 | 下一个 [–] 在代码执行期间断言假设对于正式验证代码来说不是标准做法吗？回复 jonathanstrange 11分钟前 | 上一个 [–] 没有考虑硬件故障吗？没有宇宙射线翻转位吗？没有讨论软或硬实时保证吗？关于可能失败的不确定操作（例如从某些操作系统动态请求内存）呢？我问的是因为我认为高完整性系统通常被评估和认证为硬件和软件的组合。仅考虑软件似乎毫无用处。回复考虑申请YC冬季2026批次！申请截止至11月10日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

October 10, 2025

"Correct" doesn't mean "correct" when correctly using "correct"

New Logic for Programmers Release!

v0.12 is now available! This should be the last major content release. The next few months are going to be technical review, copyediting and polishing, with a hopeful 1.0 release in March. Full release notes here.

Cover of the boooooook

I run this small project called Let's Prove Leftpad, where people submit formally verified proofs of the eponymous meme. Recently I read Breaking “provably correct” Leftpad, which argued that most (if not all) of the provably correct leftpads have bugs! The lean proof, for example, should render leftpad('-', 9, אֳֽ֑) as ---------אֳֽ֑, but actually does ------אֳֽ֑.

You can read the article for a good explanation of why this goes wrong (Unicode). The actual problem is that correct can mean two different things, and this leads to confusion about how much formal methods can actually guarantee us. So I see this as a great opportunity to talk about the nature of proof, correctness, and how "correct" code can still have bugs.

What we talk about when we talk about correctness

In most of the real world, correct means "no bugs". Except "bugs" isn't a very clear category. A bug is anything that causes someone to say "this isn't working right, there's a bug." Being too slow is a bug, a typo is a bug, etc. "correct" is a little fuzzy.

In formal methods, "correct" has a very specific and precise meaning: the code conforms to a specification (or "spec"). The spec is a higher-level description of what is supposed the code's properties, usually something we can't just directly implement. Let's look at the most popular kind of proven specification:

-- Haskell
inc :: Int -> Int
inc x = x + 1

The type signature Int -> Int is a specification! It corresponds to the logical statement all x in Int: inc(x) in Int. The Haskell type checker can automatically verify this for us. It cannot, however, verify properties like all x in Int: inc(x) > x. Formal verification is concerned with verifying arbitrary properties beyond what is (easily) automatically verifiable. Most often, this takes the form of proof. A human manually writes a proof that the code conforms to its specification, and the prover checks that the proof is correct.

Even if we have a proof of "correctness", though, there's a few different ways the code can still have bugs.

1. The proof is invalid

For some reason the proof doesn't actually show the code matches the specification. This is pretty common in pencil-and-paper verification, where the proof is checked by someone saying "yep looks good to me". It's much rarer when doing formal verification but it can still happen in a couple of specific cases:

The theorem prover itself has a bug (in the code or introduced in the compiled binary) that makes it accept an incorrect proof. This is something people are really concerned about but it's so much rarer than every other way verified code goes wrong, so is only included for completeness.
For convenience, most provers and FM languages have an "just accept this statement is true" feature. This helps you work on the big picture proof and fill in the details later. If you leave in a shortcut, and the compiler is configured to allow code-with-proof-assumptions to compile, then you can compile incorrect code that "passes the proof checker". You really should know better, though.

2. The properties are wrong

Galois

This code is provably correct:

inc :: Int -> Int
inc x = x-1

The only specification I've given is the type signature Int -> Int. At no point did I put the property inc(x) > x in my specification, so it doesn't matter that it doesn't hold, the code is still "correct".

This is what "went wrong" with the leftpad proofs. They do not prove the property "leftpad(c, n, s) will take up either n spaces on the screen or however many characters s takes up (if more than n)". They prove the weaker property "len(leftpad(c, n, s)) == max(n, len(s)), for however you want to define len(string)". The second is a rough proxy for the first that works in most cases, but if someone really needs the former property they are liable to experience a bug.

Why don't we prove the stronger property? Sometimes it's because the code is meant to be used one way and people want to use it another way. This can lead to accusations that the developer is "misusing the provably correct code" but this should more often be seen as the verification expert failing to educate devs on was actually "proven".

Sometimes it's because the property is too hard to prove. "Outputs are visually aligned" is a proof about Unicode inputs, and the core Unicode specification is 1,243 pages long.

Sometimes it's because the property we want is too hard to express. How do you mathematically represent "people will perceive the output as being visually aligned"? Is it OS and font dependent? These two lines are exactly five characters but not visually aligned:

|||||

MMMMM

Or maybe they are aligned for you! I don't know, lots of people read email in a monospace font. "We can't express the property" comes up a lot when dealing with human/business concepts as opposed to mathematical/computational ones.

Finally, there's just the possibility of a brain fart. All of the proofs in Nearly All Binary Searches and Mergesorts are Broken are like this. They (informally) proved the correctness of binary search with unbound integers, forgetting that many programming languages use machine integers, where a large enough sum can overflow.

3. The assumptions are wrong

This is arguably the most important and most subtle source of bugs. Most properties we prove aren't "X is always true". They are "assuming Y is true, X is also true". Then if Y is not true, the proof no longer guarantees X. A good example of this is binary sort, which only correctly finds elements assuming the input list is sorted. If the list is not sorted, it will not work correctly.

Formal verification adds two more wrinkles. One: sometimes we need assumptions to make the property valid, but we can also add them to make the proof easier. So the code can be bug-free even if the assumptions used to verify it no longer hold! Even if a leftpad implements visual alignment for all Unicode glyphs, it will be a lot easier to prove visual alignment for just ASCII strings and padding.

Two: we need make a lot of environmental assumptions that are outside our control. Does the algorithm return output or use the stack? Need to assume that there's sufficient memory to store stuff. Does it use any variables? Need to assume nothing is concurrently modifying them. Does it use an external service? Need to assume the vendor doesn't change the API or response formats. You need to assume the compiler worked correctly, the hardware isn't faulty, and the OS doesn't mess with things, etc. Any of these could change well after the code is proven and deployed, meaning formal verification can't be a one-and-done thing.

You don't actually have to assume most of these, but each assumption drop makes the proof harder and the properties you can prove more restricted. Remember, the code might still be bug-free even if the environmental assumptions change, so there's a tradeoff in time spent proving vs doing other useful work.

Another common source of "assumptions" is when verified code depends on unverified code. The Rust compiler can prove that safe code doesn't have a memory bug assuming unsafe code does not have one either, but depends on the human to confirm that assumption. Liquid Haskell is verifiable but can also call regular Haskell libraries, which are unverified. We need to assume that code is correct (in the "conforms to spec") sense, and if it's not, our proof can be "correct" and still cause bugs.

These boundaries are fuzzy. I wrote that the "binary search" bug happened because they proved the wrong property, but you can just as well argue that it was a broken assumption (that integers could not overflow). What really matters is having a clear understanding of what "this code is proven correct" actually tells you. Where can you use it safely? When should you worry? How do you communicate all of this to your teammates?

Good lord it's already Friday

If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.

My new book, Logic for Programmers, is now in early access! Get it here.