Fugu-MT 論文翻訳(概要): Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

論文の概要: Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

arxiv url: http://arxiv.org/abs/2511.13288v2
Date: Tue, 18 Nov 2025 03:13:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 13:59:16.886875
Title: Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO
Title（参考訳）: マルチエージェントディープリサーチ:M-GRPOを用いたマルチエージェントシステムのトレーニング
Authors: Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu,
Abstract要約: 現在の訓練方法は、システム内のすべてのエージェントに対して統一された大きな言語モデルを訓練する。これにより、異なるエージェントの基本的な分布が異なるため、パフォーマンスが制限される可能性がある。垂直多エージェントシステムに対するグループ相対ポリシー最適化の階層的拡張であるM-GRPOを提案する。
参考スコア（独自算出の注目度）: 24.532870400949424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.
Abstract（参考訳）: マルチエージェントシステムは一般的な推論タスクでよく機能する。しかし、専門分野における訓練の欠如は、その正確さを妨げている。現在の訓練方法は、システム内のすべてのエージェントに対して統一された大言語モデル(LLM)を訓練する。これにより、異なるエージェントの基盤となる分布が異なるため、パフォーマンスが制限される可能性がある。したがって、異なるLLMを用いたマルチエージェントシステムのトレーニングが次のステップとなる。しかし、このアプローチには最適化の課題が伴う。例えば、エージェントは異なる周波数で動作し、ロールアウトにはさまざまなサブエージェント呼び出しが含まれ、エージェントは別々のサーバにまたがってデプロイされることが多く、エンドツーエンドの勾配フローを妨害する。これらの課題に対処するために,M-GRPOを提案する。M-GRPOは,主エージェント(プランナ)と複数のサブエージェント(マルチターンツールエグゼキュータ)を備えた垂直多エージェントシステム用に設計されたグループ相対ポリシー最適化の階層的拡張である。 M-GRPOは、メインエージェントとサブエージェントの両方に対してグループ相対的な利点を計算し、階層的なクレジット割り当てを維持する。また、可変サブエージェント呼び出しにかかわらず、固定サイズのバッチを生成するトラジェクトリアライメントスキームも導入している。エージェントが別々のサーバ上で動作し、共有ストアを介して最小限の統計情報を交換する、分離されたトレーニングパイプラインをデプロイします。これにより、サーバ間のバックプロパゲーションなしでスケーラブルなトレーニングが可能になる。実世界のベンチマーク(例えば、GAIA、XBench-DeepSearch、WebWalkerQA)の実験では、M-GRPOはシングルエージェントGRPOとマルチエージェントGRPOの両方を凍結サブエージェントで一貫して上回り、安定性とサンプル効率が改善されている。これらの結果から,異種軌道の整列化と特殊エージェント間のデカップリング最適化により,ツール強化推論タスクが促進されることが示唆された。

論文の概要: Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

関連論文リスト