Fugu-MT 論文翻訳(概要): UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

論文の概要: UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

arxiv url: http://arxiv.org/abs/2510.20286v1
Date: Thu, 23 Oct 2025 07:18:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:17.522358
Title: UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
Title（参考訳）: UI-Ins: マルチパースペクティブなインストラクション・アズ・推論によるGUIグラウンディングの強化
Authors: Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi,
Abstract要約: 本稿では,インストラクション・アズ・ア・推論(Instruction-as-Reasoning)パラダイムを導入し,インストラクションを動的解析経路として扱う。これを実現するために,教師付き微調整と強化学習という2段階のトレーニングフレームワークを提案する。得られたモデルであるUI-Ins-7BとUI-Ins-32Bは、5つの挑戦的なグラウンドベンチマークで最先端の結果を得る。
参考スコア（独自算出の注目度）: 51.54456545661045
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.
Abstract（参考訳）: 自然言語命令を実行可能なUI要素にマッピングするGUIグラウンドは、GUIエージェントのコア機能である。以前の作業では、命令の多様性と品質がグラウンドパフォーマンスに与える影響を見越して、命令をユーザ意図の静的プロキシとして扱うことが多かった。既存の基盤データセットを慎重に調査することで、命令の23.3%の欠陥率を見つけ、命令の多様性の推測時間利用が、性能改善率を最大で76%向上させることを示す。本稿では,インストラクション・アズ・ア・推論(Instruction-as-Reasoning)パラダイムを導入し,インストラクションを動的解析経路として扱い,異なる視点を提供するとともに,推論中に最も効果的な経路を選択することを可能にする。そこで本研究では,合成された多視点推論を指導し,さらに経路選択と構成を最適化する強化学習(RL)を施した2段階の訓練フレームワークを提案する。得られたモデルであるUI-Ins-7BとUI-Ins-32Bは、5つの挑戦的なグラウンドベンチマークで最先端の結果を達成し、推論時に新しい命令経路を選択的に合成・合成する創発的推論を示す。特に、UI-Ins-32Bは、UI-I2E-Benchで87.3%、ScreenSpot-Proで57.0%、MMBench-GUI L2で84.9%という、最高のグラウンド精度を達成した。さらに,本モデルでは,UI-Ins-7Bをエグゼキュータとして,AndroidWorldで74.1%の成功率を達成した。提案手法は,SFT+RLフレームワークにおける政策崩壊を緩和する手法である。すべてのコードとモデルチェックポイントがhttps://github.com/alibaba/UI-Insで公開される。

論文の概要: UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

関連論文リスト