GTA1: GUI Test-time Scaling Agent
- URL: http://arxiv.org/abs/2507.05791v3
- Date: Thu, 10 Jul 2025 01:10:25 GMT
- Title: GTA1: GUI Test-time Scaling Agent
- Authors: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li,
- Abstract summary: This paper investigates the two main challenges with our GUI Test-time Scaling Agent, GTA1.<n>First, to select the most appropriate action proposal, we introduce a test-time scaling method.<n>Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements.
- Score: 77.60727242084971
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets. This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld). We open-source our code and models here.
Related papers
- GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z) - Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents.<n>Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy.<n>We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z) - Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning [8.363126388041408]
We introduce VisiDroid, a multi-modal, multi-agent framework that iteratively determines the next action and leverages visual images of screens to detect the task's completeness.<n>Our evaluation shows that VisiDroid achieves an accuracy of 87.3%, outperforming the best baseline relatively by 23.5%.
arXiv Detail & Related papers (2025-04-22T14:02:57Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface.
We propose a chain-of-action technique to help the agent decide what action to execute.
We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z) - Glance and Gaze: Inferring Action-aware Points for One-Stage
Human-Object Interaction Detection [81.32280287658486]
We propose a novel one-stage method, namely Glance and Gaze Network (GGNet)
GGNet adaptively models a set of actionaware points (ActPoints) via glance and gaze steps.
We design an actionaware approach that effectively matches each detected interaction with its associated human-object pair.
arXiv Detail & Related papers (2021-04-12T08:01:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.