DeepSeek发布开源权重数学模型,性能达到IMO金牌水平。
DeepSeek releases open-weights math model with IMO gold medal performance

原始链接: https://huggingface.co/deepseek-ai/DeepSeek-Math-V2

大型语言模型(LLM)的最新进展显示出在数学推理方面的潜力,定量竞赛的性能也在迅速提高。然而,仅仅获得正确的*答案*并不能保证可靠的*推理*,这对于定理证明等复杂任务至关重要。本研究的重点是开发LLM中的“自我验证”数学推理能力。 研究团队训练了一个基于LLM的验证器来评估证明的严谨性,然后将该验证器用作奖励模型来训练一个证明生成器。该生成器被激励在最终确定解决方案*之前*识别并纠正其自身工作中的错误。为了持续改进验证器,他们扩展了验证计算量,以自动标记具有挑战性的证明,用于训练数据。 由此产生的模型DeepSeekMath-V2,展示了显著的定理证明能力,在IMO 2025、CMO 2024等具有挑战性的基准测试中取得了最高分,并在Putnam 2024中获得了接近完美的成绩。这些结果表明,自我验证是构建更强大、更具能力的数学人工智能系统的一种可行途径。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 DeepSeek 发布开放权重的数学模型,IMO 金牌表现 (huggingface.co) 31 分,由 victorbuilds 发表于 36 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 victorbuilds 2 分钟前 [–] 值得注意的是:他们使用 Apache 2.0 协议开源了权重,与 OpenAI 和 DeepMind 不同,后者的 IMO 金牌模型仍然是专有的。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

1. Introduction

Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute. While much work remains, these results suggest that self-verifiable mathematical reasoning is a feasible research direction that may help develop more capable mathematical AI systems.

2. Evaluation Results

Below are evaluation results on IMO-ProofBench (developed by the DeepMind team behind DeepThink IMO-Gold) and recent mathematics competitions including IMO 2025, CMO 2024, and Putnam 2024.

IMO-ProofBench


Mathematics Competitions

4. Quick Start

DeepSeekMath-V2 is built on top of DeepSeek-V3.2-Exp-Base. For inference support, please refer to the DeepSeek-V3.2-Exp github repository.

6. License

This repository and the model weights are licensed under the Apache License, Version 2.0 (Apache 2.0).

7. Citation

@misc{deepseek-math-v2,
  author = {Zhihong Shao, Yuxiang Luo, Chengda Lu, Z.Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang},
  title = {DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning},
  year = {2025},
}

8. Contact

If you have any questions, please raise an issue or contact us at [email protected].

联系我们 contact @ memedata.com