Fugu-MT 論文翻訳(概要): DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

論文の概要: DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

arxiv url: http://arxiv.org/abs/2604.24320v1
Date: Mon, 27 Apr 2026 11:09:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.916002
Title: DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents
Title（参考訳）: DPEPO:LLMエージェントの並列探索最適化
Authors: Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao,
Abstract要約: 大きな言語モデル(LLM)エージェントは、シーケンシャルな「レアソン・アクト」パラダイムに従っており、多くの複雑なタスクにおいて優れたパフォーマンスを達成している。本稿では,エージェントが複数の環境を同時に操作できる新しいパラダイムを提案する。本稿では,多種多様な並列探索を行うための強化学習アルゴリズムであるDPEPOを提案する。
参考スコア（独自算出の注目度）: 38.16347415282427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)
Abstract（参考訳）: 大規模言語モデル (LLM) エージェントは, 逐次的「レアソン・アクト」パラダイムに従えば, 多くの複雑なタスクにおいて優れた性能を発揮するが, これらの手法は, 1ステップに1つの環境しか扱わないため, 限られた探索と不完全な環境理解に悩まされている。本稿では,エージェントが複数の環境を同時に操作し,軌道横断体験を共有できる新しいパラダイムを提案する。このパラダイムに基づいて、エージェントが多様な並列探索を行うことを促す強化学習(RL)アルゴリズムであるDPEPOを提案する。 DPEPOには2つの段階がある: 初期教師付き微調整(SFT)は基本的な並列推論とアクション生成を付与し、次いで階層的な報酬スキームを持つ強化学習段階を付与する。並列軌道レベルの成功報酬と2つのステップレベルの報酬を設計する: 横行動逆転と横状態遷移逆転は、行動の冗長性を積極的に罰し、広範囲な探索を促進する。 ALFWorldとScienceWorldの大規模な実験によると、DPEPOは高いシーケンシャルベースラインに匹敵する効率を維持しつつ、最先端(SOTA)の成功率を達成する。 (コードはhttps://github.com/LePanda026/Code-for-DPEPO)。

論文の概要: DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

関連論文リスト