Fugu-MT 論文翻訳(概要): DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

論文の概要: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

arxiv url: http://arxiv.org/abs/2509.25454v2
Date: Wed, 01 Oct 2025 05:09:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-02 14:33:21.829279
Title: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Title（参考訳）: DeepSearch:モンテカルロ木探索による検証可能なリワードによる強化学習の基盤を克服
Authors: Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi,
Abstract要約: 我々はモンテカルロ木探索を直接RLVRトレーニングに統合するフレームワークであるDeepSearchを紹介する。推論時にのみツリー検索に依存する既存のメソッドとは対照的に、DeepSearchは構造化された検索をトレーニングループに埋め込む。コントリビューションには,(1)検索ツリー全体にわたって有望なノードを優先するグローバルフロンティア選択戦略,(2)監督のための確実なパスを識別するエントロピーベースのガイダンスによる選択,(3)効率的なソリューションキャッシングによる適応的リプレイバッファトレーニングなどが含まれている。
参考スコア（独自算出の注目度）: 53.27052683356095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
Abstract（参考訳）: RLVRは、LLMの高度な推論技術開発に欠かせない要素となっているが、現代の研究は、数千の最適化手順に従って現れる訓練台地を文書化しており、計算投資の増加にもかかわらず、顕著な性能向上を示している。この制限は、現在のRLVRプラクティスに固有のスパース探索パターンに起因しており、モデルはしばしば重要な推論パスを見逃し、ソリューション空間の体系的なカバレッジを提供するのに失敗する限定的なロールアウトに依存している。我々はモンテカルロ木探索を直接RLVRトレーニングに統合するフレームワークであるDeepSearchを紹介する。推論時にのみツリー検索に依存する既存のメソッドとは対照的に、DeepSearchは構造化された検索をトレーニングループに組み込み、体系的な探索と推論ステップ間のきめ細かいクレジット割り当てを可能にする。トレーニング時間の探索を通じて、DeepSearchは不十分な探索の根本的なボトルネックに対処する。コントリビューションには,(1)検索ツリー全体にわたって有望なノードを優先するグローバルフロンティア選択戦略,(2)監督のための確実なパスを識別するエントロピーベースのガイダンスによる選択,(3)効率的なソリューションキャッシングによる適応的リプレイバッファトレーニングなどが含まれている。数学的推論ベンチマークの実験によると、DeepSearchは平均精度62.95%に達し、1.5B推論モデルの新たな最先端を確立している。これらの結果は, ブルートフォーススケーリングに対する戦略的探索の重要性を強調し, RLVR手法の進歩に向けたアルゴリズム的革新の可能性を実証している。 DeepSearchは、長い計算ではなく、体系的な検索を通じて推論能力をスケールするための新しい方向を確立する。

論文の概要: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

関連論文リスト