Fugu-MT 論文翻訳(概要): V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task

論文の概要: V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task

arxiv url: http://arxiv.org/abs/2508.13634v1
Date: Tue, 19 Aug 2025 08:47:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.856043
Title: V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task
Title（参考訳）: V2P:ロバストGUIグラウンディングタスクのバックグラウンド抑圧からセンターピーク化へ
Authors: Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, Jinjie Gu,
Abstract要約: Valley-to-Peakメソッドは、人間がGUI要素を視覚的に処理し、操作する方法にインスパイアされている。 V2PでトレーニングされたモデルはScreenSpot-v2とScreenSpot-Proの2つのベンチマークで92.3%と50.5%を達成した。
参考スコア（独自算出の注目度）: 16.500878734275936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform labeling fails to distinguish between center and edges of the target UI element, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a Fitts' Law-inspired approach by modeling GUI interactions as 2D Gaussian heatmaps where the weight gradually decreases from the center towards the edges. The weight distribution follows a Gaussian function, with the variance determined by the target's size. Consequently, V2P effectively isolates the target area and teaches the model to concentrate on the most essential point of the UI element. The model trained by V2P achieves the performance with 92.3% and 50.5% on two benchmarks ScreenSpot-v2 and ScreenSpot-Pro. Ablations further confirm each component's contribution, highlighting V2P's generalizability for precise GUI grounding tasks.
Abstract（参考訳）: GUI エージェントの開発には GUI 要素の正確なローカライズが不可欠である。伝統的な手法は境界ボックスや中心点回帰に依存し、空間的相互作用の不確実性や視覚的意味的階層を無視している。近年の手法では,(1)処理の背景領域を無視した場合,所望の領域から注意を逸脱させ,(2)ターゲットUI要素の中心と端の均一なラベル付けが失敗し,クリック不正確になる,という2つの問題に直面している。ヒトがGUI要素を視覚的に処理し、操作する方法に触発されて、これらの問題に対処するためのV2P法を提案する。背景の混乱を軽減するため、V2Pはモデルが意図した領域を強調するために無関係な領域にフォーカスすることを最小限に抑止注意機構を導入する。中心端の区別の問題に対して、V2P は Fitts の法則に着想を得たアプローチを適用し、GUI の相互作用を2次元ガウス熱マップとしてモデル化し、重みが中心から端まで徐々に減少する。重み分布はガウス函数に従っており、その分散はターゲットのサイズによって決定される。これにより、V2Pはターゲット領域を効果的に分離し、UI要素の最も重要な点に集中するようにモデルに教える。 V2Pでトレーニングされたモデルは、ScreenSpot-v2とScreenSpot-Proの2つのベンチマークで92.3%と50.5%のパフォーマンスを達成した。アブレーションは各コンポーネントの貢献をさらに確認し、正確なGUIグラウンドタスクに対するV2Pの一般化性を強調している。

論文の概要: V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task

関連論文リスト