Fugu-MT 論文翻訳(概要): XSkill: Continual Learning from Experience and Skills in Multimodal Agents

論文の概要: XSkill: Continual Learning from Experience and Skills in Multimodal Agents

arxiv url: http://arxiv.org/abs/2603.12056v1
Date: Thu, 12 Mar 2026 15:25:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.180883
Title: XSkill: Continual Learning from Experience and Skills in Multimodal Agents
Title（参考訳）: XSkill:マルチモーダルエージェントにおける経験とスキルからの継続的な学習
Authors: Guanyu Jiang, Zhaochen Su, Xiaoye Qu, Yi R., Fung,
Abstract要約: XSkillはマルチモーダルエージェントの経験とスキルから継続的に学習するためのデュアルストリームフレームワークである。 XSkillは、視覚観察における知識抽出と検索の両方の基礎となる。 XSkillは、ツールのみのベースラインと学習ベースのベースラインの両方を一貫して、実質的に上回っている。
参考スコア（独自算出の注目度）: 21.536999624068716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.
Abstract（参考訳）: マルチモーダルエージェントは多様なツールで複雑な推論タスクに取り組むことができるようになったが、それでも非効率なツールの使用や、オープンな設定での非フレキシブルなオーケストレーションに悩まされている。重要な課題は、過去の軌跡から学ぶことでパラメータ更新なしに、そのようなエージェントを継続的に改善できるようにすることである。この目標に不可欠な2つの相補的な再利用可能な知識を識別する:経験、ツールの選択と意思決定のための簡潔なアクションレベルのガイダンス、そしてスキル、計画とツール使用のための構造化されたタスクレベルのガイダンスを提供する。この目的のために,マルチモーダルエージェントにおける経験とスキルから連続的な学習を行うためのマルチストリームフレームワークであるXSkillを提案する。 XSkillは、視覚観察における知識抽出と検索の両方の基礎となる。蓄積中、XSkillは視覚的に接地された要約とクロスロールアウトの批評を通じて、マルチパスロールアウトの経験とスキルを蒸留し、統合する。推論の間、この知識を現在の視覚的コンテキストに検索して適応し、使用履歴を蓄積して連続的な学習ループを形成する。 XSkillは4つのバックボーンモデルを持つさまざまなドメインにわたる5つのベンチマークで評価され、ツールのみのベースラインと学習ベースのベースラインの両方で大幅にパフォーマンスが向上している。さらに分析した結果,エージェントの推論行動に影響を与える2つの知識ストリームが相補的な役割を担い,ゼロショットの一般化が優れていることが明らかとなった。

論文の概要: XSkill: Continual Learning from Experience and Skills in Multimodal Agents

関連論文リスト