Fugu-MT 論文翻訳(概要): Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

論文の概要: Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

arxiv url: http://arxiv.org/abs/2509.06493v1
Date: Mon, 08 Sep 2025 09:54:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:04.047959
Title: Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
Title（参考訳）: LLMステッププローバのマルチターンオフポリシィRLとマルチエージェントツリー探索のスケールアップ
Authors: Ran Xin, Zeyu Zheng, Yanchen Nie, Kun Yuan, Xia Xiao,
Abstract要約: 本稿では,2つのスケーリング問題に対処するシステムである textttBFS-Prover-V2 を紹介する。 1つ目は、トレーニング時にLLMのステッププロデューサの性能を継続的に改善する、新しいマルチターンオフポリチフレームワークである。第二のイノベーションは、推論時に推論能力を拡大するプランナーによるマルチエージェント検索アーキテクチャである。
参考スコア（独自算出の注目度）: 16.135928990655422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces \texttt{BFS-Prover-V2}, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. \texttt{BFS-Prover-V2} achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.
Abstract（参考訳）: LLM(Large Language Models)を自動定理証明に統合することは、非常に有望であるが、トレーニング時強化学習(RL)と推論時計算の両方をスケールアップする際の課題により、根本的な制約が課されている。本稿では、この二重スケーリング問題に対処するために設計されたシステムである「texttt{BFS-Prover-V2}」を紹介する。主なイノベーションは2つある。 1つ目は、トレーニング時にLLMのステッププロデューサの性能を継続的に改善する、新しいマルチターンオフポリチフレームワークである。 AlphaZeroの原理にインスパイアされたこのフレームワークは、適応的な戦術レベルのデータフィルタリングと周期的リトレーニングを備えた多段階のエキスパートイテレーションパイプラインを使用して、LLMベースのエージェントで通常、長期RLを縮小するパフォーマンスプレートを克服する。第二のイノベーションは、推論時に推論能力を拡大するプランナーによるマルチエージェント検索アーキテクチャである。このアーキテクチャは、複素定理をより単純な部分ゴールの列に反復的に分解するために、高レベルプランナーとして一般的な推論モデルを用いる。この階層的なアプローチは検索スペースを大幅に削減し、並列証明エージェントのチームが共有証明キャッシュを利用することで効率よく協調することを可能にする。このスケーリングに対する2つのアプローチは、確立された公式な数学ベンチマークにおける最先端の結果をもたらすことを実証する。 \texttt{BFS-Prover-V2} はそれぞれ MiniF2F と ProofNet のテストセットで 95.08\% と 41.4\% を達成する。形式数学の分野において実証されているが、この研究で提示されたRLおよび推論技術はより広い関心を持ち、長い水平多重ターン推論と複素探索を必要とする他の領域に適用することができる。

論文の概要: Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers

関連論文リスト