数学提炼挑战 – 方程理论
Mathematics Distillation Challenge – Equational Theories

原始链接: https://terrytao.wordpress.com/2026/03/13/mathematics-distillation-challenge-equational-theories/

最近的数学研究正在探索一种协作方法,超越传统的个人工作,利用社区输入和人工智能。Polymath项目和协作形式化努力展示了这种转变,最终促成了2024年等式理论项目(ETP)。ETP成功地利用人工智能和形式化验证解决了普遍代数中超过2200万个真假问题。 现在,SAIR基金会发起一项新的竞赛挑战,旨在将ETP的结果“提炼”成一份简洁的“作弊单”——一套小的指导原则,以提高较小、功能较弱的人工智能模型的性能。目前,这些廉价模型的表现处于随机水平,但研究人员相信,设计良好的作弊单可以提高它们的准确性。 这项挑战以一个公开测试“游乐场”启动,邀请参与者创建这些作弊单(小于10KB),并将通过多个阶段进行,最终不仅需要答案,还需要证明或反例。目标是发现适用于更广泛数学挑战的有效问题解决技术,并可能在其他数学领域复制这种提炼过程。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 数学提炼挑战 – 方程理论 (terrytao.wordpress.com) 5 分,由 picafrost 1小时前发布 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Mathematical research traditionally involves a small number of professional mathematicians working closely on difficult problems. However, I have long believed that there is a complementary way to do mathematics, in which one works with a broad community of mathematically minded people on problems which may not be as deep as the problems one traditionally works on, but still are of mathematical interest; and that modern technologies, including AI, are more suitable for contributing the latter type of workflow. The “Polymath projects” were one example of this broad type of collaboration, where internet platforms such as blogs and wikis were used to facilitate such collaboration. Some years later, collaborative formalization projects (such as the one to formalize the Polynomial Freiman–Ruzsa conjecture of Marton, discussed previously on this blog here) became popular in some circles. And in 2024, I launched the Equational Theories Project (ETP) (discussed on this blog here and here), combining the rigor of Lean formalization with “good old fashioned AI” (in the form of automated theorem provers) to settle (with formal verification) over 22 million true-false problems in universal algebra.

Continuing in this spirit, Damek Davis and I are launching a new project, in the form of an experimental competitive challenge hosted by the SAIR Foundation (where I serve as a board member, and which is supplying technical support and compute). The idea of this challenge, motivated in part by this recent paper of Honda, Murakami, and Zhang, is to measure the extent to which the 22 million universal algebra true-false results obtained by the ETP can be “distilled” into a short, human-readable “cheat sheet”, similar to how a student in an undergraduate math class might distill the knowledge learned from that class into a single sheet of paper that the student is permitted to bring into an exam.

Here is a typical problem in universal algebra that the ETP was able to answer:

Problem 1 Suppose that {*: M \times M \rightarrow M} is a binary operation such that {x * (y * z) = (z * w) * w} for all {x,y,z,w}. Is it true that {x * (y * x) = (x * y) * z} for all {x,y,z}?

Such a problem can be settled either by algebraically manipulating the initial equation to deduce the target equation, or by finding a counterexample to the target equation that still satisfies the initial equation. There are a variety of techniques to achieve either of these, but this sort of problem is difficult, and even undecidable in some cases; see this paper of the ETP collaborators for more discussion. Nevertheless, many of these problems can be settled with some effort by humans, by automated theorem provers, or by frontier AI systems; here for instance is an AI-generated solution to the above problem.

However, these AI models are expensive, and do not reveal much insight as to where their answers come from. If one instead tries a smaller and cheaper model, such as one of the many open-source models available, it turns out that these models basically perform no better than random chance, in that when asked to say whether the answer to a question such as the above is true or false, they only answer correctly about 50% of the time.

But, similarly to how a student struggling with the material for a math class can perform better on an exam when provided the right guidance, it turns out that such cheap models can perform at least modestly better on this task (with success rates increasing to about 55%-60%) if given the right prompt or “cheat sheet”.

“Stage 1” of the distillation challenge, which we launched today, asks for contestants to design a cheat sheet (of at most 10 kilobytes in size) that can increase the performance of these models on the above true-false problems to as high a level as possible. We have provided a “playground” with which to test one’s cheat sheet (or a small number of example cheat sheets) some cheap models against a public set of 1200 problems (1000 of which were randomly selected, and rather easy, together with 200 “hard” problems that were selected to resist the more obvious strategies for resolving these questions); a brief video explaining how to use the playground can be found here.

Submissions stage will end on April 20, after which we will evaluate the submissions against a private subset of test questions. The top 1000 submissions will advance to a second stage which we are currently in the process of designing, which will involve more advanced models, but also the more difficult task of not just providing a true-false answer, but also a proof or counterexample to the problem.

The competition will be coordinated on this Zulip channel, where I hope there will be a lively and informative discussion.

My hope is that the winning submissions will capture the most productive techniques for solving these problems, and/or provide general problem-solving techniques that would also be applicable to other types of mathematical problems. We started with the equational theory project data set for this pilot competition due to its availability and spectrum of difficulty levels, but if this type of distillation process leads to interesting results, one could certainly run in on many other types of mathematical problem classes to get some empirical data on how readily they can be solved, particularly after we learn from this pilot competition on how to encourage participation and share of best practices.

SAIR will also launch some other mathematical challenges in the coming months that will be of a more cooperative nature than this particular competitive challenge; stay tuned for further announcements.

联系我们 contact @ memedata.com