Fugu-MT 論文翻訳(概要): GTA1: GUI Test-time Scaling Agent

論文の概要: GTA1: GUI Test-time Scaling Agent

arxiv url: http://arxiv.org/abs/2507.05791v3
Date: Thu, 10 Jul 2025 01:10:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-11 12:24:00.078413
Title: GTA1: GUI Test-time Scaling Agent
Title（参考訳）: GTA1: GUIテストタイムスケーリングエージェント
Authors: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li,
Abstract要約: 本稿ではGUIテストタイムスケーリングエージェントGTA1の2つの課題について検討する。まず、最も適切なアクション提案を選択するために、テスト時間スケーリング手法を提案する。第2に、選択したアクション提案を対応する視覚要素にグラウンドする際の精度の向上を実現するモデルを提案する。
参考スコア（独自算出の注目度）: 77.60727242084971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets. This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld). We open-source our code and models here.
Abstract（参考訳）: グラフィカルユーザインタフェース(GUI)エージェントは、視覚要素と対話してタスクを完了するために、プラットフォーム(例えばLinux)間で自律的に操作する。具体的には、ユーザ命令を一連のアクション提案に分解し、それぞれがGUIとのインタラクションに対応する。各アクションの後、エージェントは更新されたGUI環境を観察し、次のステップを計画する。しかし、主な課題は2つある。一課題計画の曖昧さ(例えば、適切な計画を選択する場合の行動提案順序)を解決し、かつ、有効な案が多数存在すること。二複雑で高解像度なインターフェース、すなわち視覚的ターゲットと正確に相互作用するアクションを正確に接地すること。本稿では,GUIテストタイムスケーリングエージェントであるGTA1の2つの課題について検討する。まず、最も適切なアクション提案を選択するために、テスト時間スケーリング手法を提案する。各ステップにおいて、複数の候補アクションの提案をサンプリングし、判断モデルを利用して最も適したアクションを評価し、選択する。並列サンプリング、タスク実行手順の短縮、全体的なパフォーマンスの向上によって、計算をより良い意思決定品質にトレードオフする。第2に、選択したアクション提案を対応する視覚要素にグラウンドする際の精度の向上を実現するモデルを提案する。我々のキーとなる洞察は、強化学習(RL)が本質的に客観的なアライメントを通じて視覚的なグラウンドニングを促進し、インターフェース要素のクリックを成功させることである。実験により,多種多様なベンチマークにまたがる最先端性能が確立された。例えば、GTA1-7BはScreenspot-Pro、Screenspot-V2、OSWorld-Gでそれぞれ50.1%、92.4%、67.7%のアキュラシーを達成している。テストタイムスケーリング戦略を適用したプランナーと組み合わせると、最先端のエージェントパフォーマンス(OSWorldの45.2%タスク成功率など)を示す。コードとモデルをオープンソースで公開しています。

論文の概要: GTA1: GUI Test-time Scaling Agent

関連論文リスト