Fugu-MT 論文翻訳(概要): Adaptive Vision-Language Model Routing for Computer Use Agents

論文の概要: Adaptive Vision-Language Model Routing for Computer Use Agents

arxiv url: http://arxiv.org/abs/2603.12823v1
Date: Fri, 13 Mar 2026 09:21:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.021593
Title: Adaptive Vision-Language Model Routing for Computer Use Agents
Title（参考訳）: コンピュータ利用エージェントのための適応型視覚言語モデルルーティング
Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen,
Abstract要約: コンピュータ利用エージェントは、命令をクリック、キーストローク、スクロールなどのアクションに変換する。現在のCUAシステムは、通常、困難にかかわらず全てのアクションを単一の固定モデルにルーティングする。本稿では,CUAオーケストレータとVLMプール間の軽量なセマンティックルーティング層を挿入するフレームワークであるConfusedbf VLM Routing (AVR)を提案する。
参考スコア（独自算出の注目度）: 9.457255218406333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.
Abstract（参考訳）: コンピュータ利用エージェント(CUA)は、スクリーンショットを解釈し、接地されたツールコールを予測するために、Vision-Language Model(VLM)に依存して、クリック、キーストローク、スクロールなどの自然言語命令をグラフィカルユーザインタフェース(GUI)アクションに変換する。しかしながら、接地精度はVLM間で劇的に変化し、現在のCUAシステムは、通常、困難にかかわらず全てのアクションを単一の固定モデルにルーティングする。本稿では,CUAオーケストレータとVLMのプールとの間に,軽量なセマンティックルーティング層を挿入するフレームワークである‘textbf{Adaptive VLM Routing} (AVR) を提案する。ツールコール毎に、AVRはマルチモーダル埋め込みからアクションの難しさを推定し、信頼度を測定するために小さなVLMを探索し、予測精度が目標の信頼性閾値を満たす最も安価なモデルにアクションをルーティングする。以前のUIインタラクションのメモリを持つ \textit{warm} エージェントの場合、検索されたコンテキストは、小さなモデルと大きなモデルの間の能力ギャップをさらに狭め、エスカレーションなしで多くのアクションを処理できる。ルーティングをコスト精度トレードオフとして形式化し、モデル選択のためのしきい値ベースのポリシーを導出し、OpenClawエージェントルーティングベンチマークとともにScreenSpot-Proのグラウンドデータを用いてAVRを評価する。これらの設定全体で、AVRは、全モデルベースラインの2ポイント以内にとどまりながら、最大78\%の推論コスト削減を計画している。 Visual Confused副ガードレールと組み合わせることで、AVRは最も強力な利用可能なモデルに直接リスクの高いアクションをエスカレートし、単一のルーティングフレームワーク内で効率と安全性を統一する。モデル、ベンチマーク、コードも提供されている。

論文の概要: Adaptive Vision-Language Model Routing for Computer Use Agents

関連論文リスト