Fugu-MT 論文翻訳(概要): Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

論文の概要: Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

arxiv url: http://arxiv.org/abs/2603.14448v1
Date: Sun, 15 Mar 2026 15:47:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.812146
Title: Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements
Title（参考訳）: Zoom to Essence: インターフェース要素の推測によるトレインレスGUIグラウンディング
Authors: Ziwei Liu, Tao Feng, Borui Kang, Yanbing Yang, Jun Luo,
Abstract要約: マルチモーダル大言語モデル (MLLM) ベースのグラフィカルユーザインタフェース (GUI) エージェントは急速に発達する。既存のGUIエージェントは、命令やUIインターフェースを理解する際の課題を処理するために、大規模なデータセット上でMLLMを微調整するのが一般的である。本稿では,推論スケーリングを活用して,命令要素を段階的にアンカーする共通MLLMを,より詳細なインターフェース要素に誘導するZoomUIを提案する。
参考スコア（独自算出の注目度）: 40.21437107734778
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents develop rapidly, with visual grounding that maps natural language instructions to target UI elements serving as the core capability. Existing GUI agents typically fine-tune MLLM on massive datasets to handle challenges in understanding instructions and UI interfaces, which not only incurs high data annotation costs but also makes performance dependent on data quality and distribution. To avoid such cumbersome yet ineffective training, we notice that complex UI interfaces can be decomposed into basic visual elements directly understandable by common MLLMs. Consequently, we propose ZoomUI that leverages inference scaling to guide common MLLMs in progressively anchor instruction elements to increasingly detailed interface elements. Specifically, ZoomUI first optimizes the latent thinking to transform original instruction into element visual features description, and subsequently leverages internal attention to iteratively zoom in target element interface region. Evaluations on extensive benchmarks demonstrate that ZoomUI reaches or even surpasses SOTA baselines.
Abstract（参考訳）: マルチモーダル大規模言語モデル (MLLM) ベースのグラフィカルユーザインタフェース (GUI) エージェントは、自然言語命令をコア機能として機能するUI要素にマッピングする視覚的基盤によって、急速に発達する。既存のGUIエージェントは、命令やUIインターフェースを理解する際の課題を処理するために、大規模なデータセット上でMLLMを微調整する。このような煩雑で非効率なトレーニングを避けるため、複雑なUIインターフェースを一般的なMLLMで直接理解できる基本的なビジュアル要素に分解できることに気付いた。そこで本研究では,推論スケーリングを活用するZoomUIを提案し,命令要素を段階的にアンカーする一般的なMLLMを,より詳細なインターフェース要素に誘導する。具体的には、ZoomUIはまず潜在思考を最適化し、元の命令を要素の視覚的特徴記述に変換する。広範なベンチマークによる評価は、ZoomUIがSOTAベースラインに到達または超えていることを示している。

論文の概要: Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements

関連論文リスト