（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40130924

本文讨论了在处理任务时使用人力与人工智能 (LLM) 之间的差异，特别是在错误管理和纠正方面。作者认为，虽然人类可以在出现错误时通过理解上下文和适应的能力来提供追索和自我纠正，但法学硕士缺乏这种能力，因为他们只是根据预先存在的算法和数据处理信息。当法学硕士发生错误时，用户纠正这种情况的选项有限，无法改进或选择不同的模型。然而，最大的挑战在于缺乏自我纠正，因为法学硕士仅依赖于最初的编程，并且不会从过去的错误中学习或改进。尽管计算机性能和能力取得了进步，但法学硕士目前在预测人类需求的能力方面落后于人类，导致经常出现差异以及需要进行广泛的测试和监督。为了验证数学概念，使用了 Metamath 和 Lean 等证明助手，通过形式化方法提供严格的验证并确保准确性。通过专注于小内核，与人类解释相比，这些平台提供了更高的可靠性。此外，跨各种编程语言的多重验证进一步增强了可信度并消除了潜在的偏见或不一致。总体而言，虽然不断取得重大进步，但目前的法学硕士仍然缺乏人类从经验中学习、适应不断变化的环境以及展现真正的洞察力或创造力的能力。

I would say that there is very little danger of a proof in Lean being incorrect.

There is a serious danger, which has nothing to do with bugs in Lean, which is a known problem for software verification and also applies in math: one must read the conclusions carefully to make sure that the right thing is actually proved.

I read Wilshaw's final conclusions carefully, and she did indeed prove what needed to be proved.

The paper makes a similar point like this:

Every definition and theorem in mathlib and this project have been checked by Lean’s trusted kernel, which computationally verifies that the proofs we have constructed are indeed correct. However, Lean cannot check that the statements of the definitions and theorems match their intended English equivalents, so when drawing conclusions from the code in this project, translation to and from English must be done with care.

Do you trust LLM so much that you don't check what it writes before sending the email?

LLMs can write better English, but the curating step is still critical, because it also produces a lot of garbage.

Would you trust a brand new assistant to write up an email for you without proof reading it? How much training would they require before you didn't need that step? How much training / fine-tuning would an LLM need? What about the next gen LLM?

Remember, we're not talking about a static target here, and the post I replied to set no qualifications on the claim that a human will always be needed to check that a mathematical definitions in the proof match the English equivalents. That's a long timeline on a rapidly moving target that is, as I said, already seems to be better than most humans at understanding and writing English.

> Would you trust a brand new assistant to write up an email for you without proof reading it?

Depends on the complexity, but for the simpler things I think I could get confident in a day or so. For more complex things, it might take longer to assess their ability.

But I'm not going to trust LLM blindly for anything.

> I replied to set no qualifications on the claim that a human will always be needed to check that a mathematical definitions in the proof match the English equivalents.

I don't defend this strong claim and limit my answer to LLMs (and mostly just state of the art). OTOH I believe that trust will continue to be a big topic for any future AI tech.

> But I'm not going to trust LLM blindly for anything.

Again, what does "blindly" mean? Suppose you went a month without finding a single issue. Or two months. Or a year. The probability of failure must literally be zero before you rely on it without checking? Are you applying the same low probability failure on the human equivalent? Does a zero probability of failure for a human really seem plausible?

I feel like this conversation is incorrectly conflating "probability of error" with "recourse when things go wrong".

Choosing to handle it yourself does not reduce probability of error to zero, but it does move the locus of consequence when errors occur. "you have nobody to blame but yourself".

One reason people might trust humans over AI regardless of failure rate is answering the questions "what recourse do I have when there is an error" compounded by "is the error model self-correcting": EG when an error occurs, does some of the negative consequence serve to correct the cause of the error or doesn't it.

With another human in the loop, their participation in the project or their personal honor or some property can be jeopardized by any error they are responsible for. On the one hand this shields the delegator from some of the consequence because if the project hemorrhages with errors they can naturally demote or replace the assistant with another who might not have as many errors. But on the other hand, the human is incentivized to learn from their mistakes and avoid future errors so the system includes some self-correction.

Using a static inference LLM, the user has little recourse when there is an error. Nowhere to shift the blame, probably can't sue OpenAI over losses or anything like that. Hard to replace an LLM doing a bad job aside from perhaps looking at ways to re-fine-tune it, or choose a different model which there aren't a lot of materially competing examples.

But the biggest challenge is that "zero self-correction" avenue. A static-inference LLM isn't going to "learn from its mistakes", and the same input + random seed will always produce the same output. The same input with a randomized seed will always produce the same statistical likelihood of any given erroneous output.

You'd have to keep the LLM on a constant RLHF fine tuning treadmill in order for it to actually learn from errors it might make, and then that re-opens the can of worms of catastrophic forgetting and the like.

But most importantly, that's not the product that is presently being packaged one way or the other and no company can offer any "learning" option to a single client at an affordable price that doesn't also commoditize all data used for that learning.

> You'd have to keep the LLM on a constant RLHF fine tuning treadmill in order for it to actually learn from errors it might make, and then that re-opens the can of worms of catastrophic forgetting and the like.

If the LLM required a constant fine-tuning treadmill, you wouldn't actually use it in this application. You could tell if you were on such a treadmill because its error rate wouldn't be improving fast enough in the initial phases while you were still checking its work.

As for what recourse you have in case of error, that's what fine-tuning is for. Your recourse is you change the fine-tuning to better handle the errors, just like you would correct a human employee.

Employees are not financially liable for mistakes they make either, just their job is at stake, but this is all beside the point, at the end of the day the only rational question is: if the LLM's error rate is equal to or lower than a human employee, why prefer the human?

Constructing a sentence is only the last step in writing, akin to pressing the shutter release on a camera. LLMs can turn a phrase but they have nothing to say because they do not experience the world directly. They can only regurgitate and remix what others have said.

This doesn't seem to follow. Why kind computers get better at doing this (anticipating what humans want or whatever) than people? Some people are better at it than others and people are not magic, so I'd guess at some point computers will get it too.

The problem I express relates to the issues people mention about libraries: if a defined concept is used, one has to be sure the definition is correct (i.e., that the right thing has been proved).

Wilshaw's formalization is not vulnerable to this objection, though libraries were used. What is proved is that a certain defined concept satisfies a certain suite of formulas of first order logic. If there is a predicate satisfying that suite of formulas, NF is consistent.

and it very much IS an essential part of my confidence in this proof that conversations between me and Sky Wilshaw reveal that she understands my argument [and was able to point out errors and omissions in paper versions!]

human interaction helps create confidence. But the software is extremely reliable: a philosophical challenge based on bugs in theorem proving software just is not going to hold water.

There's a (small, grey) link that reads 'edit' among the links at the top of each comment you can use if you want to change or add something to a comment you've already written, if you prefer that to replying to yourself.

The edit happens inline, immediately when you hit update - it won’t take you back to the thread but the comment is updated right above the input field (and in thread). It’s not terribly important either way.

I am sure you know this, but for the audience: the danger can be mitigated somewhat with a "test suite" of theorems and examples about the definitions. These examples can be very simple ("this particular object, with this particular operation, is a group; this other object is not") or much more sweeping and general (e.g. fundamental theorems like "all objects with this property are isomorphic" or "all objects with this property embed canonically in this other construction"). It doesn't prove that your definitions are correctly capturing what your natural-language proof talks about, but it can help you be more confident.

It's kind of a root of trust problem, isn't it? I think the algorithm for checking proofs is relatively simple. All those fancy tactics boil down to a sequence of rewrites of an expression tree, using a small handful of axiomatic rewrite rules.

The trusted codebase becomes that checking algorithm, along with the "compiler" that translates the high-level language to term rewriting syntax. Formally verifying that codebase is a rather circular proposition (so to speak), but you could probably bootstrap your way to it from equations on a chalkboard.

> Every definition and theorem in mathlib and this project have been checked by Lean’s trusted kernel, which computationally verifies that the proofs we have constructed are indeed correct.

From a foundational perspective, it is also important to note that this proof is one of equiconsistency between NF and the Lean kernel, which itself is handchecked. Mechanized theorem provers preserve that level of correctness imputed to them via outside injection, from humans or other out-of-band systems.

Congratulations on the verification of your proof! It must be great to have your research life's crowning work being formally confirmed! Also a great victory for the new foundations of Quine.

I have shown the consistency of New Foundations. My aim is not actually to promote it as a working set theory. NFU, which admits Choice, is probably better for that. But if there are people who want to use NF as the foundation, it is now seen to be as secure as a rather small fragment of ZFC.

If I’m not mistaken, this is the first use of a proof assistant to settle the status of a difficult proof that had been sitting in limbo for years. There were some previous projects (e.g. the Four Color Theorem in Coq) that validated existing proofs with a large computational element handled by untrusted software, but I think this is the first one where the epistemological status of the result was uncertain in the larger mathematical community.

There is a major difference though: Holmes's proof was broadly comprehensible, just extremely complicated and easy to get lost in the details. In particular Holmes really tries to make the reader understand, with a fairly gentle/apologetic introduction.

But Mochizuki's "proof" is completely impenetrable even to experts on, like, page 3, and makes no effort to explain what is happening.

Another key difference is that New Foundations is a niche field, so there simply was not a huge amount of human effort spent reading Holmes's work. That's not the case with Mochizuki's proof. There's a big difference between "a small number of mathematicians didn't understand the proof but suspect it's correct" and "a large number of mathematicians didn't understand the proof and concluded it was incorrect."

And, most of all: Holmes's formulation of twisted type theory made the proof a natural candidate for dependently-typed formal verification. Mochizuki's proof is not type-theoretic and does not seem like a great candidate for the calculus of constructions - maybe it is! I actually have no idea. I suspect Mochizuki is the only person in the world who can answer that. But it's critical that Holmes did so much background work to simplify his proof and make this Lean program possible. Mochizuki should do the same.

AFAICT, both in terms of the "sociology of mathematics" and the amount of work required to even attempt a Lean formalization, trying to verify Mochizuki's proof is a waste of time.

Kirti Joshi from the IAS has made very serious attempts to patch up Mochizuki's work by developing new math and by making the exposition good. If you're an expert in arithmetic geometry presumably he's readable, if you're just interested, I found it fun to skim.

I write about his work on this website to bring notice to his work, because I don't believe Scholze and I'm not going to read Mochizuki (If I did, it would be his original actual contribution to algebraic geometry (and I don't know why I don't read Scholze but its scary mmkay)). So I mentioned that Joshi was from the IAS for the same reason, to convince my readers to read his work.

Further edit: I guess he's at Arizona right now, but he was at IAS at some point....

> And, most of all: Holmes's formulation of twisted type theory made the proof a natural candidate for dependently-typed formal verification.

I am not so sure about this, actually. "twisted type theory" looks like type theory more in the original Bertrand Russel sense than in the modern (Martin Loef, Jean-Yves Girard, etc.) sense.

Just bolting llms onto proof search improves the state of the art. So if you want to improve the state of the art of proof search, bolt some llms onto proof search, and enjoy!

As far as making LLMs understand mochi math... I'm going to go out on a limb and say it will probably take less time for us to build an AI that understands mochi than it would take to understand mochi ourselves.

I would bet for this use case LLMs to be worse than random walk, as they would probably disregard certain possibilities in crucial steps, unless we are talking about easier stuff.

I have studied IUT since 2012, and it is indeed, totally baroque. However, Motizuki Theory, is totally rebased and admits results of much interest. I will write more on this matter if my claim is of mutual interest.

IRT, the topic and digression here, LLM and LEAN4 are of not much use for IUT.

IUT Theory is much easier understood by a hacker than by a math guy, eventually tho, monitors and tv's did kinda become the same thing but there are, some minor differences.

Can someone give a rough description of what is special, or novel, in the "New Foundations" set theory formulation, relative to other formulations? Or at least, link to a description readable by, say, a math undergrad student or an engineering professional?

I was just editing the Wikipedia article, it should be more readable now: https://en.wikipedia.org/wiki/New_Foundations

I think the main thing is the existence of the universal set. For my use case, the type system of a programming language, such a universal set is very useful. There are various hacks used in existing systems like cumulative universes or type-in-type which are not as satisfying - instead, I can just check that the type signature is stratified and then forget about types having numerical levels.

I agree, a "cool" think about NF is the universal set.

Another way to be introduced to New Foundations, along with how one can use it, is the Metamath database for New Foundations: https://us.metamath.org/nfeuni/mmnf.html

Metamath is a proof system, but unlike most alternative systems like Lean, it doesn't have a built-in set of axioms - you need to declare your axioms, and then you can prove theorems (using only previous axioms and proven theorems). So you can declare the axioms of New Foundations, and then use them to prove other things.

One thing you immediately discover when you try to use New Foundations is that "the usual definition of the ordered pair, first proposed by Kuratowski in 1921 and used in the regular Metamath Proof Explorer, has a serious drawback for NF and related theories that use stratification. The Kuratowski ordered pair is defined as << x , y >> = { { x } , { x , y } }. This leads to the ordered pair having a type two greater than its arguments. For example, z in << << x , y >> , z >> would have a different type than x and y, which makes multi-argument functions extremely awkward to work with. Operations such as "1st" and complex "+" would not form sets in NF with the Kuratowski ordered pairs. In contrast, the Quine definition of ordered pairs, defined in definition df-op, is type level. That is, <. x , y >. has the same type as x and y, which means that the same holds of <. <. x , y >. , z >. This means that "1st" is a set with the Quine definition, as is complex "+". The Kuratowski ordered pair is defined (as df-opk), because it is a simple definition that can be used by the set construction axioms that follow it, but for typical uses the Quine definition of ordered pairs df-op is used instead."

One eye-popping result is that the Axiom of Choice can be disproven in NF. See that site (or other pages about NF) for details.

> Metamath is a proof system, but unlike most alternative systems like Lean, it doesn't have a built-in set of axioms - you need to declare your axioms, and then you can prove theorems (using only previous axioms and proven theorems). So you can declare the axioms of New Foundations, and then use them to prove other things.

I've only used lean a little bit but I'm pretty sure you can start lean with a blank slate, declare whatever you want and then build up from there. In fact the Natural Number Game[1] is basically this, you develop the Peano axioms etc and prove everything about the natural numbers from scratch without using the standard library (which obviously would have all that built in).

[1] https://adam.math.hhu.de/#/g/leanprover-community/NNG4

No, Lean is not suitable for axiomatic investigations, it comes with too much baggage from "classical foundations". As Randall said above, Lean is axiomatically much stronger than NF, and that's even with "no axioms"! You can use Lean to prove things about axiom systems, but you have to model the axiom system explicitly as a "deep embedding" with syntax and a typing judgment. For metatheory work like the one reported on here this is exactly what you want, but if you want to actually work in the theory then it's an extra layer of indirection which makes things a lot more cumbersome compared to using Lean's own logic.

Metamath is much more configurable in this regard, you just directly specify the axiom system you want to work in and there is no special status given to first order logic or anything like that.

> In contrast, the Quine definition of ordered pairs, defined in definition df-op, is type level. That is, <. x , y >. has the same type as x and y

How is that not a problem? The type of a set needs to be higher than its elements to prevent the construction of a set containing itself. If a tuple is the same type as an elements, then can't you construct a tuple that contains itself as its first element, i.e. "x.0 = x" is a stratified formula so x exists.

The Quine pair works in ordinary set theory (Zermelo or ZFC); it has a mildly baroque definition but there is no problem with it. Look at the machinery and you will see why a pair (as opposed to a general set) doesnt strictly speaking need to be of higher type than its components.

to the original poster,

the universe is a boolean algebra in NF: sets have complements, there is a universe.

The number three is the set of all sets with three elements (this is not a circular definition; it is Frege's definition from way back); in general, a cardinal number is an equivalence class under equinumerousness, defined in the usual set theoretic way, and an ordinal number is an equivalence class of well-orderings under similarity.

When you look at objects which lead to paradox (the cardinality of the universe, the order type of the natural well-ordering on the ordinals) then you discover the violence inherent in the system. Very strange things happen, which are counterintuitive. None of this depends on my proof to work: these are all features of NFU (New Foundations with urelements) which has been known to be consistent since 1969, and one can explore what is happening by looking at its known models, which are far simpler than what I construct.

> The number three is the set of all sets with three elements

I can't make up my mind if this is extremely elegant, or mindbogglingly horrendous; awesome or awful. I lean towards elegant, because the mindboggle caused by traditional set theory.

The project is concerned with NF itself; the status of NFU was settled by Jensen in 1969 (it can be shown to be consistent fairly easily). Showing consistency of NF is difficult.

There is nothing mysterious about urelements: an urelement is simply an object which is not a set. To allow urelements is to weaken extensionality to say, objects with the same elements which are sets are equal; objects which are not sets have no elements (but may be distinct from each other and the empty set).

urelements aren't mysterious at all. They are simply things which are not sets. If you allow urelements, you weaken extensionality, to say that sets with the same elements are equal, while non-sets have no elements, and may be distinct from each other and the empty set.

Allowing urelements can be viewed as a return to common sense :-)

If urelements may be distinct from each other, or the same, it seems like you could place the universe under a type and name it urelement, without creating a new axiom -- Except for, a urelement defined in this way could never be equal to the empty set - hence the new axiom? Am I understanding this correctly?

While you're around, I have a question: In TST, the restriction for x^m in y^n is that n = m+1, i.e. the level must increment by 1. In TTT, the restriction is instead that m < n, there is no requirement that it only be 1. Now, in NF, the restriction for set membership in stratified formulas is also +1. Is it possible to relax this to a "Tangled NF" where the set membership restriction is only that m < n? This would resolve issues such as how many levels it takes to construct an ordered pair.

The thing I find really aesthetically pleasing about NF is how it adjusts the axiom of subset selection so that "the set of all sets" doesn't cause Russell's paradox. Basically it requires that the predicates you use to select subsets must obey a certain very lightweight type system. "x is not a member of itself" is not a well-typed question under any reasonable type system, and in particular it doesn't satisfy NF's "stratifiability" requirement, so you can't use it to construct the Russell's-paradox set of all sets which do not contain themselves.

That sounds extremely underwhelming to my mind.

Like having a programming language sold as free of any awful metaprogramming constructions (and induced headaches) through guarantees that no metaprogramming at all can be done in it.

A language that forbids to talk about its own grammatical structures is doomed to see its users abandon or extend it when they will inevitably deal with a situation where exchanging about the communication tool at hand is required to appropriately deal with its possibilities, expectable behaviors and limits.

Your comment doesn't make any sense. Also:

> A language that forbids to talk about its own grammatical structures

That's just not the case, and it's not what is being claimed here.

I’m not acquainted with NF, but to my perspective what the comment I was replying to was suggesting is that the structures can’t express anything about a structure of the same level of comprehension.

Which, to my mind, implies that an expression should not be allowed to embed an expression. Which is roughly equivalent to disallow a code to embed code written in the same language, let alone a code aware of the language structure and able manipulate the current code base abstract syntax tree or any equivalent facility.

Once again, I don’t claim this about NF, which I wasn’t aware of before this thread. I never dug into Quine’s work to be transparent on the matter.

You just move the sorts around, you don't get rid of having to have sorts. I guess a rug with dirt swept under it is more aesthetically pleasing than a rug with dirt on top of it?

One "nice" thing about NF is that it has only two axioms/axiom schemes: 1) sets with the same members are the same, and 2) any stratifiable property corresponds to a set off all the things which have that property. And the definition of "stratifiable" isn't very complicated.

ZF, by contrast, has eight rather ad hoc axioms/axiom schemes.

> Randall Holmes has claimed to have a proof of its consistency. In this repository, we use the interactive theorem prover Lean to verify the difficult part of his proof, thus proving that New Foundations is indeed consistent.

I think proponents of Lean are a little bit too strong in their use of language. Lean is not a superior method of proof, as is often implied. It is an alternative way of proof. If one looks into getting into Lean, one quickly realizes this situation. It is a programming language and system, with its own bugs, and you are highly reliant upon various stacks of libraries that other humans have written. These libraries have choices made in them and potentially gaps or even bugs.

So I think I take issue with the language here that it was Lean that said the proof was good. I think the more accurate and honest language is that the written proof was verified by human mathematicians and that the proof was translated into Lean by humans and verified there as well. I don't think it's necessarily accurate, or I haven't seen explanations that say otherwise, that it's Lean that provides the sole, golden verification, but that is often implied. I think the subtitle is the most accurate language here: "A digitisation of Randall Holmes' proof".

I do think that machine-verified proofs in strong systems like Lean are far superior to proofs merely verified by humans. Humans are amazing, but they also get bored and miss details.

This isn't just a theoretical claim. People read Euclid's Elements for over two thousand years before noticing a missing axiom. That's the sort of basic mistake that any properly-working machine proof verification system would immediately reveal. Published math proofs are often later revealed to be wrong. The increasing sophistication of math means it's getting harder and harder for humans to properly verify every step. Machines are still not as good at creating proofs, but they're unequalled at checking them.

There are "competing" systems to Lean, so I wouldn't say that Lean is the "one true way". I like Metamath, for example. But the "competition" between these systems needs to be in quotes, because each has different trade-offs, and many people like or use or commit to multiples of them. All of them can verify theorems to a rigor that is impractical for humans.

Correct. The good news that Elements still works otherwise, you just need to add the missing axiom.

But many other "proofs" have been found to be false. The book "Metamath: A Computer Language for Mathematical Proofs" (by Norm Megill and yours truly) is available at: https://us.metamath.org/downloads/metamath.pdf - see section 1.2.2, "Trusting the Mathematician". We list just a few of the many examples of proofs that weren't.

Sure, there can be bugs in programs, but there are ways to counter such bugs that give FAR more confidence than can be afforded to humans. Lean's approach is to have a small kernel, and then review the kernel. Metamath is even more serious; the Metamath approach is to have an extremely small language, and then re-implement it many times (so that a bug is unlikely to be reproduced in all implementations). The most popular Metamath database is "set.mm", which uses classical logical logic and ZFC. Every change is verified by 5 different verifiers in 5 different programming languages originally developed by 5 different people:

* metamath.exe aka Cmetamath (the original C verifier by Norman Megill)

* checkmm (a C++ verifier by Eric Schmidt)

* smetamath-rs (smm3) (a Rust verifier by Stefan O'Rear)

* mmj2 (a Java verifier by Mel L. O'Cat and Mario Carneiro)

* mmverify.py (a Python verifier by Raph Levien)

For more on these verifiers, see: https://github.com/metamath/set.mm/blob/develop/verifiers.md

While there indeed may be bugs, it's my understanding that any trust needs to be placed only on the kernel.

> various stacks of libraries that other humans have written

If you're referring to mathlib here, I'm not sure this is correct. As again all mathlib code compiles down to code that is processed by the kernel.

Indeed this is reinforced by the draft paper on the website [0]:

> Lean is a large project, but one need only trust its kernel to ensure that accepted proofs are correct. If a tactic were to output an incorrect proof term, then the kernel would have the opportunity to find this mistake before the proof were to be accepted.

[0]: https://zeramorphic.github.io/con-nf-paper/main.l.pdf

I think that's an overly formalistic view by the Lean team.

When I looked into Lean last (sometime last year), it was a hodgepodge of libraries and constructions. The maturity of the library depended upon if someone in that area of math had spent a lot of time building one up. There were even competing libraries and approaches. To me, that implies that there's "choice" there, just as there is in mathematics. And a choice could be "buggy". What's to say that the construction of a mathematical structure in Lean is the same as the mathematical structure in "normal" mathematics?

So yes, if that statement by the Lean team is accurate, you can build correct things on top of the kernel, but the use of "correct" is a very formal one. It's formally correct in terms of the kernel, but it doesn't mean that it's necessarily mathematically correct. This is to my understanding of course, garnered from trying to get into Lean.

I think you are misunderstanding the relationship between the theorem and the proof. You express doubt that the proof is a good one, but then you point to the libraries in Lean as a possible source of incorrectness. The libraries cannot be the source of incorrectness in the proof. They can only be the source of incorrectness in the statement of the theorem. That's why others are challenging you to declare that you don't trust the Lean kernel. A kernel bug is the only thing that can permit incorrectness in the proof.

The exact content of the proof doesn't matter for the purposes of deciding whether you should believe the theorem. At all. What matters are two things: 1) Is the theorem correctly stated? That is, does it say what it's supposed to say in the agreed upon (natural or programming) language? 2) Do you trust the Lean kernel (or human reviewer of the proof, if applicable)?

If you accept 1) and 2), then you must accept the truth of the theorem.

No library used in the construction of the proof, nor any library used in the expression or the proof, can make or break the result.

If one uses a library in the statement of the theorem, then that matters. That's letting someone else define some of the terms. And one must investigate whether those definitions are correct in order to decide whether the theorem is stated correctly. If you wish to challenge the work on these grounds, by all means proceed. But you can't say you don't think the proof is a good one for such reasons.

But like, you can look at what parts of Mathlib this development imports, it's mainly stuff imported by files in this subdirectory https://github.com/leanprover-community/con-nf/tree/main/Con... , and it's pretty basic things: the definition of a permutation, a cardinal number etc. Almost all of these are things that would feature in the first one or two years of an undergraduate math degree (from just quickly scanning it, the most advanced thing I could see is the definition of cofinality of ordinals). It seems practically impossible to me that someone would make a mistake when e.g. defining what a group is, in a way subtle enough to later break this advanced theorem. If you think that people could mess up that, then all of math would be in doubt.

Your analogy to buggy implementations in ordinary programming languages is deeply flawed though.

The things formally defined and stated are proven. There can be no bugs in a library that lead to formally stated theorems being erroneously proven.

Bugs at the library level can only appear in the form that formally stated theorems might not be what you think they are.

So I object to your characterization that they might be mathematically incorrect. The mathematical content might be different than believed though.

> The things formally defined and stated are proven. There can be no bugs in a library that lead to formally stated theorems being erroneously proven.

That can't actually be true in practice can it? These are things running on a computer, a machine, and not paper.

Additionally, I'm consider two types of "bugs". (1) bugs in software or hardware that the stack relies on; (2) "bugs" in the conceptual sense of a Lean construction being something different than what was meant or what people thought it was, which is effectively the same problem that regular mathematics runs into.

> Additionally, I'm consider two types of "bugs". (1) bugs in software or hardware that the stack relies on;

Then run the Lean proof on a non-x86 computer and on both Windows and Linux. The possibility that two independent computer architectures and two independent operating systems converge on some buggy behaviour that just so happens to allow a buggy proof through is so remote it's not worth considering further.

> (2) "bugs" in the conceptual sense of a Lean construction being something different than what was meant or what people thought it was, which is effectively the same problem that regular mathematics runs into.

A specification is orders of magnitude shorter than proofs, on average. So if you're saying they managed to reduce the amount of manual verification work and possible human error by orders of magnitude, that sounds like an uncontroversial win for correctness.

> That can't actually be true in practice can it? These are things running on a computer, a machine, and not paper.

It's true for all practical purposes. Humans are just running on protein, lipids and sugars, and these are far less reliable than transistors.

Theorem provers work like this: the axioms are input, then there is a program (using libraries) that transforms them, and the output of that program is the theorem.

If the runtime is correctly executing the program, then the theorem is proven from the axioms.

Importantly, as long as there is any program that outputs the theorem from the axioms, you have a proof. So even if your program doesn't do what you think it does, if it outputs the theorem, that doesn't matter.

So do you trust the runtime? Bit flip errors can be eliminated by running things multiple times. The more human proven statements the runtime executes correctly the more you believe it's implemented correctly. Importantly, all proofs done in lean increase the trust in the common runtime.

And: do you trust your ability to understand the formal theorem that is output. That's the question discussed elsewhere in this comment section in depth.

From my understanding, the libraries written in Lean are also formally proven by the kernel. They may be hodge-podge or incomplete, but they are (by construction) not "buggy" in a way that could lead to incorrect proofs, right?

I think their point is it’s turtles all the way down. At some point you’re relying on software, kernel or otherwise, which will always have the possibility for bugs.

It seems like Lean is an interesting tool that maybe could help with proving, but ultimately will never be authoritative itself as a statement of proof.

Of course, this is ultimately a philosophical debate, which hinges on the fact to at while Lean may be demonstrably correct in any number of known applications, there are no guarantees that it is and will remain correct. Such is the nature of software.

I think the counterpoint is it's turtles all the way down. At some point you're relying on human mind, logic or otherwise, which will always have possibility of misunderstandings and mistakes.

It seems like Mahematicians are interesring tools that maybe could help with proving, but ultimately will never be authorative itsel as a statement of proof.

Of course, this is ultimately a philosophical devate, which hinges on the fact to at while Mathematicians may be demonstrably correct in any number of known applications, there are no guarantees that they are and will remain correct. Such is the nature of wetware.

---

development of proof assistants is in noticable part fueled by the same fear - fear that wetware had a fault somewhere in the long ladder of math, causing all the layers above be a useless junk

and those things did happen, it isn't just an imagined threat

what proof assistants propose is a clever trick - write one, simple, kernel. Stare at those 100 lines of code. Stare at mathematical definitions of intended API. Conclude that those coincide. Then use that kernel on the text files with all the proof ladder

All mistrust of "whole mathematics" gets condensed in mistrust of 100 lines of code

And mistrust of wetware gets replaced with mistrust of hardware - something we already do since even before computers appeared

I’m not a mathematician, so I’d be interested in any perspectives if the two are ultimately so similar in this regard, or is there is something foundationally different about the traditional mathematical approach.

I suppose any proof is ultimately a matter of review and consensus, and potentially would rely on some fundamentally unprovable assumptions at a basic level (e.g. the underlying laws of the universe).

What I was referring to is that we don't know if ZFC, for example, is consistent. And it is only consistent if there are unprovable statements. And of course, you can't know which statements are unprovable (because then you would prove its consistency). So, in some sense you're always chasing your own tail. Theorems might be proven true only because ZFC is inconsistent and therefore everything can be proven.

The Lean language is known to be equivalent to ZFC + {existence of n < omega inaccessible cardinals}, which is roughly the common language of mathematics. The actual implementation of the kernel _may_ have bugs, but that's sort of separate from the language definition.

The sentiment I see for the purpose of Lean now and in the near future is for more collaborative mathematics. Theorems and structures are more transparent and concrete (the code is right there). Results are more consumable. The community does not seem to see its main purpose being a way to quell doubts of a paper proof.

It is a fundamentally different manner. Proofs in Lean are formal all the way down. The kernel checks the validity of a proof. This is a finite program checking the validity of the kernel to high degree once is different to checking the validity of every new paper that comes out to high degree.

Yes, the human effort of checking proofs written in Lean is O(1) as you only need to verify the kernel. The human effort of checking proofs written traditionally is O(N). Of course writing proofs in Lean is much more difficult (for now) and, checking the proof is not the only interesting part. Other mathematicians also want to understand the proof, and this will always be O(n).

Also, while it's true that bugs in mathlib cannot cause lean to accept invalid proofs as long as the kernel is correct, bugs in mathlib can still be annoying (I have only used lean briefly and I got stuck for a long time on a buggy tactic that was silently messing up my proof state).

My personal belief is that translating an informal proof from a paper to a proof in Lean, given that the prerequisites are already formalized, is an O(n) operation where n is the length of the proof. It's just translation. Right now the constant is something like 10-100, but I think with more work on matlib and AI assistance it can be brought down to maybe 1-3, meaning formalising proofs right as you write them becomes feasible. In fact the constant may go below 1 as formalization in Lean can prevent unseen errors from slowing your progress.

> It is a programming language and system, with its own bugs

The cool thing about theorem provers is that an invalid proof won’t even compile (as long as the kernel is correct). There’s no such thing as a traditional software bug that happens only at runtime in these systems when it comes to proofs, because there’s no runtime at all.

You can also use Lean as a “regular” programming language, with the corresponding risk of runtime bugs, but that’s not happening here.

You misunderstand theorem provers. This isn't kid "all abstractions leak" land -- you do not need to trust the libraries, you only need to trust the kernel.

Trusting the kernel is not trivial either, but this is still a major leap over informal proofs. With informal proofs you do in fact need to trust the "libraries" (culture, the knowledge of other people), because there is no practical way to boil down to axioms.

The difference is that in lean you only need to trust the kernel. The rest is constructed on top. If the kernel is sound everything else is as well. This is in stark contrast with a standard programming language. Where at any time a bug can be introduced. It's also in stark contrast with math where any lemma may contain an error.

I very much like this.

I wonder whether this will eventually lead to collaborative proofs , and ‘ bug fixes’ , essentially turning maths into a process similar to code on GitHub.

I wish I had the free time to keep up with the mathlib project - this is so cool. Is there any way someone can get involved in a super hands-off way?

Well, thanks a lot! One minute I'm setting up a monitoring system and then ... I've just proved two is the number after the number after zero \o/ 8)

I read "GEB" by Hofstadter after I finished my A levels (UK, aged 18). I picked up a random book in the school library to fill in 20 mins before going out on a pub crawl (as you do). Once we had finished off the Abingdon Ock Street crawl in fine style and the hangover had subsided, I devoured it. I'd never read anything like it before - what a communicator of ideas.

A few unwise life style choices later and I find myself running a small IT company for the last 25 odd years.

I'll never get beyond undergrad engineering maths n stats but it's works like GEB and the Natural Numbers Game (and I see there are more) that allow civilians like me to get a glimpse into the real thing. There is no way on earth I could possibly get to grips with the really serious stuff but then the foundations are the really serious stuff - the rest simply follows on (lol)

Whom or whoever wrote the NNG tutorials are a very good communicator. The concepts are stripped to the bare essentials and the prose is crystal clear and the tone is suitably friendly.

Top stuff.

Sort of. Except for "weird logic classes", people would not bother to informally prove (i.e. no computers / strict system) such elementary things. But we can and should ask whether in fact that is the best curriculum structure.

(My making an easy fun game of proofs, I hope we can introduce them far earlier in the school curriculum. How many people in this world really understand the difference between reasoning and charismatic rhetoric? I don't blame anyone that isn't given the way we are taught.)

I'm really not from the field but wasn't there some Gödel theorem showing every system strong enough cannot show its own consistency?

Gödel's incompleteness theorems are what you're thinking of. https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...

That said, if you have a system X, it can't prove it's own consistency, but a stronger system Y can prove its consistency (and perhaps some other stronger system can prove Y's consistency. This gives us a chain of systems, each proving the consistency of some weaker system).

That doesn't absolutely prove that the system is consistent--if Y was inconsistent, it could prove X is consistent (and also could prove that X is inconsistent). Nonetheless, it is still valuable. After all, part of our use of Y is the fact that we know of no inconsistency in it. And since formal systems often are subtly inconsistent, "consistent assuming some other system is consistent" is a lot better than "we have no proof whatsoever of consistency".

What’s interesting to note is that even if a strong system could prove its own consistency then it wouldn’t tell you anything. An inconsistent system can prove its own consistency. So if a system has a proof that it is itself consistent then you still wouldn’t know if it is consistent.

It would not. If a system can prove (I know sufficiently rich systems can’t do this but suppose it could) its own consistency you still can't conclude it is consistent.

EDIT: I'm working under the hypothetical situation in which PA could prove its consistency. I know it can't but assuming that it could prove it's own consistency you still couldn't conclude that it was consistent since an inconsistent system can prove it's consistency.

GP is (I think correctly) stating that a system that can "prove" its own consistency is definitely inconsistent. Inconsistent systems can "prove" anything; if a system can appear to prove its own consistency, it isn't.

These are weaker than the systems Godel's theorems are referring to, as discussed in the opening paragraph. Do these systems are not "strong enough" in the sense described in this thread.

Obviously I’m aware of this. As stated several times, my original comment refers to the hypothetical situation in which PA could prove its consistency without Godel’s theorems being true/known. One would not be able to conclude anything.

The point being, having PA prove its own consistency couldn’t tell you anything of value even in the case that Godel’s theorems were not true. This is an interesting phenomenon. The only way to know a system is consistent is to know all of its theorems.

You should take this proof as saying: if Lean 4 is consistent then New Foundations is consistent. There is no contradiction of Godel's incompleteness theorem.

I don't think the system here is trying to prove any of its underlying assumptions, just building on some set of existing ones. I doubt the theorem you are thinking of is applicable.

ZFC is dead, long live NF..?

As an amateur mathematician whose use of sets is mostly as a lingua franca for describing other things, it's not clear to me what implications this has for the wider mathematical field, especially if the usefulness of NF is comparable to the established ZFC and its variants. Is it expected to become as popular as ZFC in machine proofs?

I do find the existence of a universal set more intuitive, so if nothing else this proof has rekindled my own interest in formalisation.

From my naive and amateur view, the relative consistency result makes NF at least as useful as ZFC, since every model of ZFC can be extended into a model of NF. But it seems it won't make NF all that useful unless:

1. We prove NF is inconsistent. Then ZFC is also inconsistent (and the stars start winking out in the night sky ;)

2. We prove ZFC is inconsistent. Then there's still a chance NF is consistent (fingers crossed!)

I'm probably ignoring more practical "quality of life" benefits of NF, like being able to talk about proper classes, and side stepping Russell's paradox with stratified formulas.

It is no part of my agenda to promote the use of NF as an independent foundational system. It is a very odd one. But if someone wants to promote this, the consistency result says, it will work, at least in the sense that you are in no more danger of arriving at a contradiction than you are in ZFC.

They start with a large cardinal assumption which is the usual workaround - you have a proof that this theory is consistent if a large cardinal like that exists, which is something you can't prove within the theory.

It's not really a workaround. Whenever we are proving the consistency of a theory T, we are implicitly working in a stronger system. That is just how consistency proofs are done. The incompleteness theorems do not say that we cannot prove the consistency of theories, or build models of theories...in a generally accepted theory such as ZFC. They do say that if we want to prove the consistency of ZFC itself we would need to work in an even stronger system.

How is that not a workaround? It's "just how consistency proofs are done" because it's a standard, popular workaround for the problem. We absolutely would prove consistency of ZFC and similar theories in ZFC if we could, we only work in stronger systems because we have to.

（评论） (comments)

（评论）
(comments)