Fugu-MT 論文翻訳(概要): PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

論文の概要: PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

arxiv url: http://arxiv.org/abs/2605.24785v1
Date: Sun, 24 May 2026 00:07:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.437464
Title: PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
Title（参考訳）: PANDO: オンラインスキル蒸留による効率的なマルチモーダルAIエージェント
Authors: Yubo Li, Yidi Miao, Haotian Shen, Yuxin Liu,
Abstract要約: 単ロールのオンラインスキル蒸留フレームワークであるPANDOを紹介します。 910のVisualWebArenaタスクの完全なセットで、PANDOは58.3%の成功率を達成した。 300タスクのアブレーションは、ルールとルーチンがほとんどの成功をもたらすことを示している。
参考スコア（独自算出の注目度）: 9.309788574955034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.
Abstract（参考訳）: マルチモーダルWebエージェントの最近の進歩は、ロールアウト検索、検証パス、オフラインスキル発見、スペシャリストモデルスタックなど、推論時間の増大に依存することが多い。これは、Webエージェントがコストよりも、エクスペリエンスを蓄積するにつれて、より効率的になるかという、中心的な疑問を提起する。まず、VisualWebArenaからのトラジェクトリを分析し、繰り返し動作ループ、隠れた発見コスト、低いプロンプト・キャッシュの再利用の3つの非効率源を同定する。次に、構造化スキルライブラリを保守し、進捗反映、信頼に基づくスキルデモーション、階層的ルーティング、ビジュアル圧縮、キャッシュ認識プロンプトを組み合わせた、単一ロールアウトのオンラインスキル蒸留フレームワークであるPANDOを紹介します。 910のVisualWebArenaタスクの完全なセットにおいて、PANDOは58.3%の成功率を達成し、SGV(54.0%)とWALT再現(45.2%)を上回りました。 300タスクのアブレーションにより、ルールとルーチンは、ルーティング、圧縮、キャッシュアウェアといった大きなスキルライブラリを限界トークンコストの低いものにします。最後に、3つのトラジェクトリレベルの効率指標 – アクション反復率、ステップオーバーヘッド比率、Promptキャッシュ利用 – を導入して、ターミナルの成功を超えて効率を見える化します。

論文の概要: PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

関連論文リスト