Fugu-MT 論文翻訳(概要): UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

論文の概要: UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

arxiv url: http://arxiv.org/abs/2605.26646v1
Date: Tue, 26 May 2026 07:30:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.729123
Title: UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
Title（参考訳）: UnityMAS-O: LLMベースのマルチエージェントシステムのための汎用RL最適化フレームワーク
Authors: Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao,
Abstract要約: LLMに基づくマルチエージェントシステムのための汎用RL最適化フレームワークUnityMAS-Oを提案する。 UnityMAS-Oは、単一応答やポリシーの軌道ではなく、完全なランタイムを最適化単位として扱う。 UnityMAS-Oは多様なマルチエージェントシステムからトレーニング可能なマルチエージェントRLシステムへ変換するための再利用可能な基板として機能することを示す。
参考スコア（独自算出の注目度）: 46.48622741253505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.
Abstract（参考訳）: LLMベースのマルチエージェントシステムは複雑なタスクを相互作用する役割に分解するが、ほとんどのエージェントはプロンプト、ツール、制御ルールによって手作業で編成される。既存のRLポストトレーニングフレームワークは、主に単一政治最適化をターゲットとし、ユーザ定義のマルチエージェントワークフロー、構造化インタラクション、ロール固有のクレジット割り当て、設定可能なパラメータ共有の抽象化を欠いている。 LLMに基づくマルチエージェントシステムのための汎用RL最適化フレームワークUnityMAS-Oを提案する。 UnityMAS-Oは、単一の応答やポリシーの軌道ではなく、完全なワークフローを最適化単位として扱う。論理エージェントロール、グラフトラジェクトリ、ユーザ定義の報酬、エージェントモデルマッピングの4つのファーストクラスのオブジェクトを通じてワークフローを表現する。これは物理モデルパラメータから論理エージェントを分離し、完全な共有、完全な分離、部分的な共有をサポートし、ロール、ターン、トラジェクトリレベルでの報酬を割り当てる。 UnityMAS-Oは、レイベースの恒星トポロジーランタイムでVerlを拡張している。中央のコントローラはワークフローを実行し、ツールを呼び出し、構造化されたトラジェクトリを記録し、報酬をアセンブルする。最適化インフラストラクチャを書き換えることなく、エージェント、ワークフロー、モデルマッピング、報酬を定義することができる。検索強化QA、反復エージェント検索、反射コード生成においてUnityMAS-Oをインスタンス化する。 Natural Questions、HotpotQA、ホールドアウトされたコードタスクを含む、マルチエージェントRLは、最適化後の手動で指定されたワークフローを改善する。これらの結果から, UnityMAS-O は多様な LLM ベースのマルチエージェントワークフローをトレーニング可能なマルチエージェント RL システムに変換するための再利用可能な基板として機能することを示す。

論文の概要: UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

関連論文リスト