Fugu-MT 論文翻訳(概要): Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

論文の概要: Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

arxiv url: http://arxiv.org/abs/2603.07972v1
Date: Mon, 09 Mar 2026 05:18:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.50308
Title: Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning
Title（参考訳）: 人間との適応的な協調:連続学習による多エージェントLLMのメタ認知的ポリシー最適化
Authors: Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu,
Abstract要約: 本稿では,Human-In-The-Loop Multi-Agent Collaboration (HILA) フレームワークを提案する。 HILAはエージェントに、問題を自律的に解決し、いつ人間の専門家に延期するかを決定するメタ認知ポリシーを学ぶよう訓練する。挑戦的な数学的および問題解決のベンチマークの実験は、デュアルループポリシー最適化を備えたHILAが、常に高度なMASよりも優れていることを示している。
参考スコア（独自算出の注目度）: 12.114998959919978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.
Abstract（参考訳）: 個々のLarge Language Models(LLMs)のスケーリングは目覚ましい進歩を遂げましたが、次のフロンティアはマルチエージェントシステム(MAS)によるコラボレーションのスケーリングです。しかし、純粋に自律的なMASは、事前訓練されたモデルの静的知識の地平線に制約された「閉世界」システムのままである。この制限により、トレーニングデータ以外の知識を必要とするタスクが不安定になり、しばしば新しい課題の下で集団的な失敗につながる。そこで本研究では,Human-In-the-Loop Multi-Agent Collaboration (HILA) フレームワークを提案する。 HILAはエージェントに、問題を自律的に解決し、いつ人間の専門家に延期するかを決定するメタ認知ポリシーを学ぶよう訓練する。この政策を運用するために,我々は,短期的な意思決定と長期的能力向上を両立させるDual-Loop Policy Optimizationを導入する。内部ループは、遅延決定を最適化するためにコストを意識した報酬を持つグループ相対ポリシー最適化(GRPO)を適用し、外側ループは継続学習を実装し、専門家のフィードバックをエージェントの推論能力を強化する高品質な教師付き信号に変換する。挑戦的な数学的および問題解決ベンチマークの実験により、デュアルループポリシー最適化を備えたHILAは、進化したMASを一貫して上回り、協調的かつ継続的なエージェントシステムの改善のための原則的な基盤を確立した。

論文の概要: Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

関連論文リスト