Related papers: DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

URL: http://arxiv.org/abs/2408.08152v1
Date: Thu, 15 Aug 2024 13:40:03 GMT
Title: DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Authors: Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, Chong Ruan,
Abstract summary: We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4. The model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. We propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths.
Score: 16.477438279316576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. Further refinement is achieved through reinforcement learning from proof assistant feedback (RLPAF). Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the test set of the high school level miniF2F benchmark ($63.5\%$) and the undergraduate level ProofNet benchmark ($25.3\%$).

Related papers

Phi-4-reasoning Technical Report [42.508165017775]
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. We develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning. Both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.
arXiv Detail & Related papers (2025-04-30T05:05:09Z)
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning [38.4246156415702]
Kimina-Prover is a large language model that pioneers a novel reasoning-driven exploration paradigm for formal theorem proving. Trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, Kimina-Prover demonstrates strong performance in Lean 4 proof generation. Kimina-Prover sets a new state-of-the-art on the miniF2F benchmark, reaching 80.7% with pass@8192.
arXiv Detail & Related papers (2025-04-15T16:23:44Z)
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model [47.108822717757945]
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.
arXiv Detail & Related papers (2025-03-31T16:36:05Z)
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source language model that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems. First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4. Next, we develop a large dataset of formal proofs by training a series of provers. Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1
arXiv Detail & Related papers (2025-02-11T15:27:35Z)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [147.16121855209246]
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero is trained via large-scale reinforcement learning. DeepSeek-R1 incorporates multi-stage training and cold-start data before RL.
arXiv Detail & Related papers (2025-01-22T15:19:35Z)
DeepSeek-V3 Technical Report [147.16121855209246]
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
arXiv Detail & Related papers (2024-12-27T04:03:16Z)
DepthSplat: Connecting Gaussian Splatting and Depth [90.06180236292866]
We present DepthSplat to connect Gaussian splatting and depth estimation. We show that Gaussian splatting can serve as an unsupervised pre-training objective for learning powerful depth models. Our DepthSplat achieves state-of-the-art performance on ScanNet, RealEstate10K and DL3DV datasets.
arXiv Detail & Related papers (2024-10-17T17:59:58Z)
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs [60.38044044203333]
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG) We propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks.
arXiv Detail & Related papers (2024-07-02T17:59:17Z)
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence [43.589403386634615]
DeepSeek-Coder-V2 is an open-source code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro.
arXiv Detail & Related papers (2024-06-17T13:51:35Z)
Revisiting and Maximizing Temporal Knowledge in Semi-supervised Semantic Segmentation [7.005068872406135]
Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. These approaches frequently involve complex training pipelines and a substantial computational burden. We propose a PrevMatch framework that effectively mitigates the limitations by maximizing the utilization of the temporal knowledge obtained during the training process.
arXiv Detail & Related papers (2024-05-31T03:54:59Z)
Proving Theorems Recursively [80.42431358105482]
We propose POETRY, which proves theorems in a level-by-level manner. Unlike previous step-by-step methods, POETRY searches for a sketch of the proof at each level. We observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26.
arXiv Detail & Related papers (2024-05-23T10:35:08Z)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model [118.06260386652778]
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs.
arXiv Detail & Related papers (2024-05-07T15:56:43Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment [40.328294121805456]
This work builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We develop the Inverse Multi-Attentive Feature Refinement (IMAFR) module which enhances feature learning capabilities. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone.
arXiv Detail & Related papers (2023-12-13T22:20:45Z)
Revisiting Deformable Convolution for Depth Completion [40.45231083385708]
Depth completion aims to generate high-quality dense depth maps from sparse depth maps. Previous work usually employs RGB images as guidance, and introduces iterative spatial propagation to refine estimated coarse depth maps. We propose an effective architecture that leverages deformable kernel convolution as a single-pass refinement module.
arXiv Detail & Related papers (2023-08-03T17:59:06Z)
Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice [79.48432795639403]
Mirror descent value iteration (MDVI) is an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL) We study MDVI with linear function approximation through its sample complexity required to identify an $varepsilon$-optimal policy. We present Variance-Weighted Least-Squares MDVI, the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs.
arXiv Detail & Related papers (2023-05-22T16:13:05Z)
The Wisdom of Hindsight Makes Language Models Better Instruction Followers [84.9120606803906]
Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions.
arXiv Detail & Related papers (2023-02-10T12:16:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.