深seek证明器V2

深seek证明器V2
DeepSeek-Prover-V2

原始链接: https://github.com/deepseek-ai/DeepSeek-Prover-V2

DeepSeek-Prover-V2是一个开源大型语言模型，专为Lean 4中的形式化定理证明而设计。它利用一种新颖的方法，通过DeepSeek-V3驱动的递归证明搜索来合成冷启动推理数据。这个过程包括将复杂问题分解成子目标，在Lean 4中将其形式化，并将形式化证明与DeepSeek-V3的链式思维推理相结合。该统一模型整合了非正式和正式的数学推理。该模型使用合成的冷启动数据和二元正确/不正确反馈进行强化学习。DeepSeek-Prover-V2在神经定理证明方面取得了最先进的性能，其在MiniF2F-test上的高通过率以及成功解决PutnamBench中的问题证明了这一点。团队还介绍了ProverBench，这是一个包含325个问题的基准数据集，包括来自AIME竞赛和教科书例题的问题，方便对不同级别的数学问题进行全面评估。DeepSeek-Prover-V2提供7B和671B参数的模型，可通过Hugging Face的Transformers访问。

Hacker News 的讨论围绕着 DeepSeek-Prover-V2 展开，这是一个专注于数学的 LLM，擅长将复杂问题分解成更小、更容易管理的步骤。评论者强调了它在基于代码的解决问题方面的潜力，这与教授初级工程师的技术如出一辙。一位用户分享了一个机器人的例子，该机器人可以将诸如倒垃圾之类的任务分解成一系列极其详细的动作。讨论探讨了在长期项目中保持上下文联系的挑战，以及人工智能代理需要作为一个团队运作，拥有专门的角色和一个协调者的必要性。一些人认为 MoE 模型和 roo code 等工具正朝着这个方向发展。讨论还涉及当前模型的局限性以及结合错误反馈（例如来自精益方法的反馈）以提高性能的重要性。其他人讨论了该模型的“开源”性质、其规模选择（671B 和 7B），以及未来专注于特定领域的“专家”LLM 的潜力。最后，一位用户演示了它如何通过使用 Openrouter.ai 来解决用 Lean 形式化的数学问题，这种方法可能表明它比之前的 SOTA 模型更好。

Show HN：使用Lean 4进行机器学习模型的形式化验证 2025-03-23

评论 2023-10-19

Llemma：数学领域的开放语言模型 2023-10-19

QwQ-32B：拥抱强化学习的力量 2025-03-07

原文

We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model.

Synthesize Cold-Start Reasoning Data through Recursive Proof Search

To construct the cold-start dataset, we develop a simple yet effective pipeline for recursive theorem proving, utilizing DeepSeek-V3 as a unified tool for both subgoal decomposition and formalization. We prompt DeepSeek-V3 to decompose theorems into high-level proof sketches while simultaneously formalizing these proof steps in Lean 4, resulting in a sequence of subgoals.
We use a smaller 7B model to handle the proof search for each subgoal, thereby reducing the associated computational burden. Once the decomposed steps of a challenging problem are resolved, we pair the complete step-by-step formal proof with the corresponding chain-of-thought from DeepSeek-V3 to create cold-start reasoning data.

Reinforcement Learning with Synthetic Cold-Start Data

We curate a subset of challenging problems that remain unsolved by the 7B prover model in an end-to-end manner, but for which all decomposed subgoals have been successfully resolved. By composing the proofs of all subgoals, we construct a complete formal proof for the original problem. This proof is then appended to DeepSeek-V3's chain-of-thought, which outlines the corresponding lemma decomposition, thereby producing a cohesive synthesis of informal reasoning and subsequent formalization.
After fine-tuning the prover model on the synthetic cold-start data, we perform a reinforcement learning stage to further enhance its ability to bridge informal reasoning with formal proof construction. Following the standard training objective for reasoning models, we use binary correct-or-incorrect feedback as the primary form of reward supervision.
The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching $88.9$% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. The proofs generated by DeepSeek-Prover-V2 for the miniF2F dataset are available for download as a ZIP archive.

3. ProverBench: Formalization of AIME and Textbook Problems

we introduce ProverBench, a benchmark dataset comprising 325 problems. Of these, 15 are formalized from number theory and algebra questions featured in the recent AIME competitions (AIME 24 and 25), offering authentic high-school competition-level challenges. The remaining 310 problems are drawn from curated textbook examples and educational tutorials, contributing a diverse and pedagogically grounded collection of formalized mathematical problems. This benchmark is designed to enable more comprehensive evaluation across both high-school competition problems and undergraduate-level mathematics.

Area	Count
AIME 24&25	15
Number Theory	40
Elementary Algebra	30
Linear Algebra	50
Abstract Algebra	40
Calculus	90
Real Analysis	30
Complex Analysis	10
Functional Analysis	10
Probability	10
Total	325

4. Model & Dataset Downloads

We release DeepSeek-Prover-V2 in two model sizes: 7B and 671B parameters. DeepSeek-Prover-V2-671B is trained on top of DeepSeek-V3-Base. DeepSeek-Prover-V2-7B is built upon DeepSeek-Prover-V1.5-Base and features an extended context length of up to 32K tokens.

You can directly use Huggingface's Transformers for model inference. DeepSeek-Prover-V2-671B shares the same architecture as DeepSeek-V3. For detailed information and supported features, please refer to the DeepSeek-V3 documentation on Hugging Face.

The following is a basic example of generating a proof for a problem from the miniF2F dataset:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)

model_id = "DeepSeek-Prover-V2-7B"  # or DeepSeek-Prover-V2-671B
tokenizer = AutoTokenizer.from_pretrained(model_id)

formal_statement = """
import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

/-- What is the positive difference between $120\%$ of 30 and $130\%$ of 20? Show that it is 10.-/
theorem mathd_algebra_10 : abs ((120 : ℝ) / 100 * 30 - 130 / 100 * 20) = 10 := by
  sorry
""".strip()

prompt = """
Complete the following Lean 4 code:

```lean4
{}
```

Before producing the Lean 4 code to formally prove the given theorem, provide a detailed proof plan outlining the main proof steps and strategies.
The plan should highlight key ideas, intermediate lemmas, and proof structures that will guide the construction of the final formal proof.
""".strip()

chat = [
  {"role": "user", "content": prompt.format(formal_statement)},
]

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

import time
start = time.time()
outputs = model.generate(inputs, max_new_tokens=8192)
print(tokenizer.batch_decode(outputs))
print(time.time() - start)

The use of DeepSeek-Prover-V2 models is subject to the Model License.

If you have any questions, please raise an issue or contact us at [email protected].