Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
- URL: http://arxiv.org/abs/2509.06493v2
- Date: Thu, 09 Oct 2025 17:45:50 GMT
- Title: Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
- Authors: Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao,
- Abstract summary: This paper introduces textttBFS-Prover-V2, a system designed to address the dual scaling problem.<n>The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time.<n>The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time.
- Score: 16.135928990655422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.
Related papers
- Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning [7.006180736433431]
Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry.<n>We propose a novel textbfAnswer-First, Reason Later (AFRL) paradigm.<n>This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation.
arXiv Detail & Related papers (2026-02-10T17:28:12Z) - DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z) - Eliciting Chain-of-Thought Reasoning for Time Series Analysis using Reinforcement Learning [2.426309874608745]
Complex numerical time series analysis often demands multi-step reasoning capabilities beyond current models' reach.<n>We introduce Chain Of thought for Understanding Numerical Time Series (COUNTS), the first framework that trains large language models to perform Chain-of-Thought (CoT) reasoning across diverse time series tasks using reinforcement learning (RL) with verifiable rewards.<n>Our experiments demonstrate that this RL-driven approach with intermediate CoT reasoning significantly enhances LLM performance across various time series analysis tasks, opening new possibilities for complex temporal data reasoning.
arXiv Detail & Related papers (2025-10-01T17:02:28Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design [35.544075583073685]
We present the first systematic study of textitturn-level reward design for multi-turn RL algorithms and agent applications.<n>We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge.<n>Our experiments on multi-turn search tasks demonstrate that incorporating well-designed turn-level rewards enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards.
arXiv Detail & Related papers (2025-05-17T04:09:46Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation [58.799397354312596]
Large language models (LLMs) have demonstrated remarkable capabilities in various domains, particularly in system 1 tasks.<n>Recent research on System2-to-System1 methods surge, exploring the System 2 reasoning knowledge via inference-time computation.<n>In this paper, we focus on code generation, which is a representative System 2 task, and identify two primary challenges.
arXiv Detail & Related papers (2025-02-18T03:20:50Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models.
Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel.
Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.