Related papers: DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

URL: http://arxiv.org/abs/2511.22570v1
Date: Thu, 27 Nov 2025 16:01:22 GMT
Title: DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Authors: Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang,
Abstract summary: Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning.<n>To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning.<n>Our model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.
Score: 26.142347272743496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.

Related papers

Proof-RM: A Scalable and Generalizable Reward Model for Math Proof [67.53066972145183]
Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR)<n>Many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching.<n>To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required.
arXiv Detail & Related papers (2026-02-02T17:42:53Z)
Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection [42.21636315733425]
Large language models have achieved remarkable success on final-answer mathematical problems.<n>However, the reasoning underlying these solutions is often flawed.<n>We evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance.
arXiv Detail & Related papers (2025-11-17T06:25:35Z)
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math [80.46254366870447]
We introduce Hard2Verify, a step-level verification benchmark produced with over 500 hours of human labor.<n>We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models.
arXiv Detail & Related papers (2025-10-15T16:50:54Z)
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving [36.20164235042574]
In this work, we propose textbfSeed-Prover, a lemma-style whole-proof reasoning model.<n>To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning.<n>Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin.
arXiv Detail & Related papers (2025-07-31T17:00:30Z)
Solving Inequality Proofs with Large Language Models [42.667163027148916]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z)
Let's Verify Math Questions Step by Step [29.69769942300042]
MathQ-Verify is a novel pipeline designed to rigorously filter ill-posed or under-specified math problems.<n>MathQ-Verify first performs format-level validation to remove redundant instructions.<n>It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions.
arXiv Detail & Related papers (2025-05-20T04:07:29Z)
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning [95.31714779585272]
DeepMath-103K is a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9)<n>It includes rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward.<n>DeepMath-103K fosters the development of generalizable and advancing reasoning.
arXiv Detail & Related papers (2025-04-15T17:59:51Z)
Formal Theorem Proving by Rewarding LLMs to Decompose Proofs Hierarchically [29.908878832382523]
This paper focuses on improving LLMs' ability to write proofs in formal languages that permit automated proof verification/evaluation. We work in a more natural setup where the lemmas that are directly relevant to the theorem are not given to the theorem prover at test time. We design an RL-based training algorithm that encourages the model to decompose a theorem into lemmas, prove the lemmas, and then prove the theorem by using the lemmas.
arXiv Detail & Related papers (2024-11-04T05:57:40Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
TheoremQA: A Theorem-driven Question Answering dataset [100.39878559382694]
GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems.
arXiv Detail & Related papers (2023-05-21T17:51:35Z)
PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.