Fugu-MT 論文翻訳(概要): AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

論文の概要: AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

arxiv url: http://arxiv.org/abs/2603.17441v1
Date: Wed, 18 Mar 2026 07:26:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.568244
Title: AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
Title（参考訳）: AdaZoom-GUI: インストラクションリファインメントを備えた適応的なZoomベースのGUIグラウンド
Authors: Siqi Pei, Liang Tang, Tiaonan Duan, Long Chen, Shuxian Li, Kaer Huang, Yanzhe Jing, Yiqiang Yan, Bo Zhang, Chenghao Jiang, Borui Zhang, Jiwen Lu,
Abstract要約: そこで我々は,AdaZoom-GUIを提案する。AdaZoom-GUIは適応的なズームベースのGUI基盤フレームワークで,ローカライゼーションの精度と命令理解の両面を改善する。提案手法では,自然言語コマンドを明示的で詳細な記述に書き換える命令修正モジュールを導入する。さらに,予測された小要素に対して第2段階の推論を選択的に行う条件付きズームイン戦略を設計する。
参考スコア（独自算出の注目度）: 44.11867590785016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.
Abstract（参考訳）: GUIグラウンドティング(GUI grounding)は、視覚言語モデル(VLM)にとって重要な機能であり、自然言語命令からターゲット要素を特定することで、グラフィカルユーザインタフェースとの自動インタラクションを可能にする。しかし、高解像度のイメージ、小さなUI要素、曖昧なユーザーインストラクションのためにGUIスクリーンショットを基盤にするのは依然として困難である。本研究では,適応的なズームベースGUI基盤フレームワークであるAdaZoom-GUIを提案する。提案手法では,自然言語コマンドを明示的かつ詳細な記述に書き換える命令修正モジュールを導入し,基礎モデルが正確な要素のローカライゼーションに集中できるようにする。さらに、予測された小さな要素に対して第2段階の推論を選択的に行う条件付きズームイン戦略を設計し、より単純な場合において不要な計算やコンテキスト損失を回避しつつ、局所化精度を向上させる。このフレームワークをサポートするために、高品質なGUIグラウンドデータセットを構築し、グループ相対ポリシー最適化(GRPO)を用いてグラウンドモデルをトレーニングし、クリック座標と要素境界ボックスの両方を予測する。公開ベンチマーク実験により,提案手法は,高解像度GUI理解と実用的なGUIエージェントの配置において,同等あるいはそれ以上のパラメータサイズを持つモデル間での最先端性能を実証した。

論文の概要: AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

関連論文リスト