Fugu-MT 論文翻訳(概要): MVP: Multiple View Prediction Improves GUI Grounding

論文の概要: MVP: Multiple View Prediction Improves GUI Grounding

arxiv url: http://arxiv.org/abs/2512.08529v1
Date: Tue, 09 Dec 2025 12:19:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-10 22:28:07.948018
Title: MVP: Multiple View Prediction Improves GUI Grounding
Title（参考訳）: MVP: 複数のビュー予測がGUIグラウンディングを改善した
Authors: Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu,
Abstract要約: MVP(Multi-View Prediction)は、マルチビュー推論によるグラウンディングパフォーマンスを向上させる、トレーニング不要のフレームワークである。 MVPは,(1)注意誘導視点の提案,(2)最も密集した空間クラスタのセントロイドを選択することで予測をアンサンブルするマルチコーディネートクラスタリングの2つのコンポーネントから構成される。
参考スコア（独自算出の注目度）: 45.0902526257201
License: http://creativecommons.org/licenses/by/4.0/
Abstract: GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.
Abstract（参考訳）: 自然言語命令を正確なピクセル座標に変換するGUIグラウンドリングは,実用的なGUIエージェントの開発に不可欠である。しかし,既存の接地モデルでは,座標予測の不安定さが顕著であり,数ピクセルをトリミングするなどの小さな視覚的摂動は,予測を劇的に変更し,正しい結果と不正確な結果とを切り替えることが可能である。この不安定さは、特に高解像度で小さなUI要素を持つサンプルの場合、モデルパフォーマンスを著しく損なう。この問題に対処するため,マルチビュー推論によるグラウンドディング性能を向上させるトレーニングフリーフレームワークであるMulti-View Prediction (MVP)を提案する。我々の重要な洞察は、単一ビューの予測は不安定であるかもしれないが、複数の注意深く収集されたビューからの集約された予測は、正しく座標を外れ値から効果的に区別できるということである。 MVPは,(1)注意誘導視点の提案,(2)最も密集した空間クラスタのセントロイドを選択することで予測をアンサンブルするマルチコーディネートクラスタリングの2つのコンポーネントから構成される。広範囲にわたる実験は、さまざまなモデルとベンチマークでMVPの有効性を示している。特にScreenSpot-Proでは、MVPはUI-TARS-1.5-7Bを56.1%、GTA1-7Bを61.7%、Qwen3VL-8B-Instructを65.3%、Qwen3VL-32B-Instructを74.0%に引き上げている。コードはhttps://github.com/ZJUSCL/MVPで公開されている。

論文の概要: MVP: Multiple View Prediction Improves GUI Grounding

関連論文リスト