Fugu-MT 論文翻訳(概要): Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

論文の概要: Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

arxiv url: http://arxiv.org/abs/2601.22701v1
Date: Fri, 30 Jan 2026 08:22:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.31952
Title: Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference
Title（参考訳）: Q-of-Q:推論におけるQ-function Action RankingによるVLMエージェントの改善
Authors: Emilien Biré, María Santos, Kai Yuan,
Abstract要約: VLM(Vision-Language Models)は、エージェントがデジタル環境で自律的に操作するための強力なバックボーンとなっている。これらのモデルは、Webのような高速に変化する環境に適応できない。本稿では,エージェントVLMポリシーを政策再訓練なしで推論時に拡張するための新しいパラダイムを提案する。
参考スコア（独自算出の注目度）: 4.943575742796223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM's role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.
Abstract（参考訳）: VLM(Vision-Language Models)は、エージェントがWebやオペレーティングシステムのようなデジタル環境で自律的に動作するための強力なバックボーンとなっている。しかし、これらのモデルは、Webのような高速に変化する環境への不適応性に悩まされ、細調整が必要なモデルトレーニングとデータ収集によって軽減される。本研究では,エージェントVLMポリシーを政策再訓練なしで推論時に拡張するための新しいパラダイムを提案する。提案手法は,VLMが最終動作選択機構から高容量アクションプロジェクタとしての役割を分離するものである。我々は、VLMポリシーを凍結させ、それを特定の状態に対する一連の候補アクションを生成するために使用します。そして、軽量でオフラインで訓練されたQ-関数がこれらの候補をリランクし、エージェントは最も推定値の高いアクションを実行する。主な貢献は、即時政策改善のための推論中にQ-関数を直接適用することであり、政策再訓練のためのラテラブルデータにはオフラインではない。学術的なWebVoyagerベンチマークにおいて、我々の方法はエージェントの成功率を大幅に向上させ、Qwen2.5-VL-7Bエージェントを38.8%から55.7%に改善し、プロプライエタリなGPT-4.1エージェントを82.4%から88.8%に改善することを示した。

論文の概要: Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

関連論文リスト