Fugu-MT 論文翻訳(概要): DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

論文の概要: DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

arxiv url: http://arxiv.org/abs/2510.12979v1
Date: Tue, 14 Oct 2025 20:47:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.417016
Title: DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping
Title（参考訳）: DeepPlanner: アドバンテージシェイピングによるディープリサーチエージェントのスケーリング計画能力
Authors: Wei Fan, Wenlin Yao, Zheng Li, Feng Yao, Xin Liu, Liang Qiu, Qingyu Yin, Yangqiu Song, Bing Yin,
Abstract要約: 我々は、ディープリサーチエージェントの計画能力を効果的に向上するエンドツーエンドのRLフレームワークであるDeepPlannerを提案する。提案手法は,高エントロピートークンの大幅な更新を割り当てるエントロピーに基づく用語を用いてトークンレベルの優位性を形作るとともに,計画集約ロールアウトに対するサンプルレベルの優位性を選択的に向上させる。
参考スコア（独自算出の注目度）: 74.34061104176554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.
Abstract（参考訳）: 多段階推論とアクション生成能力を備えた大規模言語モデル(LLM)は、長期計画を必要とする複雑なタスクに対処するために外部ツールを活用することを約束している。しかし、既存のアプローチは、推論段階で暗黙の計画に依存するか、計画段階を最適化する方法を体系的に解決することなく、明示的なプランナーを導入するかのいずれかである。証拠として,バニラ強化学習(RL)の下では,計画トークンは他のアクショントークンよりも有意に高いエントロピーを示し,不確実な決定ポイントが過度に最適化されていないことが明らかとなった。そこで本研究では,ディープリサーチエージェントの計画能力を効果的に向上するエンド・ツー・エンドのRLフレームワークであるDeepPlannerを提案する。提案手法は,高エントロピートークンの大幅な更新を割り当てるエントロピーに基づく用語を用いてトークンレベルの優位性を形作り,プランニング集約ロールアウトにおけるサンプルレベルの優位性を選択的に向上させる。 7つのディープリサーチベンチマークにわたる大規模な実験は、DeepPlannerが計画品質を改善し、トレーニング予算が大幅に低い状態で最先端の結果を達成することを示した。

論文の概要: DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

関連論文リスト