Learning GUI Grounding with Spatial Reasoning from Visual Feedback
- URL: http://arxiv.org/abs/2509.21552v1
- Date: Thu, 25 Sep 2025 20:38:01 GMT
- Title: Learning GUI Grounding with Spatial Reasoning from Visual Feedback
- Authors: Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim,
- Abstract summary: We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function.<n>Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results.
- Score: 46.66862168972301
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing high-resolution GUI images with complex layouts. To address this issue, we reframe GUI grounding as an \emph{interactive search task}, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results on ScreenSpot-v2 ($88.8\% \rightarrow 93.9\%$) and ScreenSpot-Pro ($26.8\% \rightarrow 56.5\%$). Moreover, we observe that GUI-Cursor learns to solve the problem within two steps for 95\% of instances and can adaptively conduct more steps on more difficult examples.
Related papers
- Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding [71.97466930670936]
Grounding is a fundamental capability for building graphical user interface (GUI) agents.<n>In this paper, we investigate zoom as a strong yet underexplored prior to GUI grounding, and propose a training-free method, ZoomClick.<n> Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models.
arXiv Detail & Related papers (2025-12-05T18:39:12Z) - GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding [44.598660921968595]
We propose an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding.<n>Gui-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals.<n>It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 59.6% on ScreenSpot-Pro, 63.8% on OSWorld-G and 91.5% on ScreenSpot-v2.
arXiv Detail & Related papers (2025-11-02T05:34:21Z) - \ extsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding [37.69847052653875]
We introduce GUI-Spotlight, a model trained for image-grounded reasoning.<n>It iteratively narrows its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy.<n>On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy.
arXiv Detail & Related papers (2025-10-05T05:15:45Z) - Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding [53.14935624161711]
GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
arXiv Detail & Related papers (2025-09-29T00:06:31Z) - Test-Time Reinforcement Learning for GUI Grounding via Region Consistency [17.954613936413942]
We propose a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions.<n>We also introduce GUI-RCPO, which transforms these consistency patterns into rewards for test-time reinforcement learning.<n>Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
arXiv Detail & Related papers (2025-08-07T17:54:27Z) - GTA1: GUI Test-time Scaling Agent [77.60727242084971]
This paper investigates the two main challenges with our GUI Test-time Scaling Agent, GTA1.<n>First, to select the most appropriate action proposal, we introduce a test-time scaling method.<n>Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements.
arXiv Detail & Related papers (2025-07-08T08:52:18Z) - R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z) - GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z) - GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning [21.964100514016504]
Training effective Vision-Language Models (VLMs) for GUI agents typically depends on large-scale annotated datasets.<n>We introduce K-step GUI Transition, a self-supervised inverse dynamics task in which VLMs learn GUI dynamics by predicting the initial action that causes a transition between two GUI states.<n>We propose GUI-Shift, a reinforcement learning framework that combines rule-based optimization with data filtering to improve VLM performance.
arXiv Detail & Related papers (2025-05-18T16:34:30Z) - Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents.<n>Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy.<n>We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z) - ScaleTrack: Scaling and back-tracking Automated GUI Agents [11.046190201201348]
We propose ScaleTrack, a training framework by scaling grounding and backtracking planning for automated GUI agents.<n>We collect GUI samples of different synthesis criterions from a wide range of sources, and unified them into the same template for training GUI grounding models.<n>We design a novel training strategy that predicts the next action from the current GUI image, while also backtracking the historical actions that led to the GUI image.
arXiv Detail & Related papers (2025-05-01T09:27:13Z) - GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration [56.58744345634623]
We propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration.<n>We also introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments.
arXiv Detail & Related papers (2025-01-23T18:16:21Z) - SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents [17.43878828389188]
We propose a novel visual Graphical User Interface (GUI) agent, SeeClick, which only relies on screenshots for task automation.
To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data.
We have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments.
arXiv Detail & Related papers (2024-01-17T08:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.