Related papers: Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding

Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding

URL: http://arxiv.org/abs/2509.24133v1
Date: Mon, 29 Sep 2025 00:06:31 GMT
Title: Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding
Authors: Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai,
Abstract summary: GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
Score: 53.14935624161711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grounding natural language queries in graphical user interfaces (GUIs) presents a challenging task that requires models to comprehend diverse UI elements across various applications and systems, while also accurately predicting the spatial coordinates for the intended operation. To tackle this problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a synergistic coarse-to-fine framework that effectively improves GUI grounding performance. GMS leverages the complementary strengths of general vision-language models (VLMs) and small, task-specific GUI grounding models by assigning them distinct roles within the framework. Specifically, the general VLM acts as a 'Scanner' to identify potential regions of interest, while the fine-tuned grounding model serves as a 'Locator' that outputs precise coordinates within these regions. This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization. Our whole framework consists of five stages and incorporates hierarchical search with cross-modal communication to achieve promising prediction results. Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0\%$ and $3.7\%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7\%$, representing a $10 \times$ improvement. Additionally, GMS significantly outperforms other strong baselines under various settings, demonstrating its robustness and potential for general-purpose GUI grounding.

Related papers

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks [20.241892335459497]
We present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding.<n>VenusBench-GD spans multiple platforms, enabling hierarchical evaluation for real-word applications.
arXiv Detail & Related papers (2025-12-18T13:09:09Z)
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding [71.97466930670936]
Grounding is a fundamental capability for building graphical user interface (GUI) agents.<n>In this paper, we investigate zoom as a strong yet underexplored prior to GUI grounding, and propose a training-free method, ZoomClick.<n> Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models.
arXiv Detail & Related papers (2025-12-05T18:39:12Z)
MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements [7.2364254826655925]
MEGA-GUI is a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding.<n> MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity.<n>On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results.
arXiv Detail & Related papers (2025-11-17T07:38:05Z)
Test-Time Reinforcement Learning for GUI Grounding via Region Consistency [17.954613936413942]
We propose a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions.<n>We also introduce GUI-RCPO, which transforms these consistency patterns into rewards for test-time reinforcement learning.<n>Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
arXiv Detail & Related papers (2025-08-07T17:54:27Z)
GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding [51.497245303008015]
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction.<n>Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards.<n>We show that GUI-G$2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro.
arXiv Detail & Related papers (2025-07-21T17:53:42Z)
R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z)
DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning [53.42606072841585]
We introduce DiMo-GUI, a training-free framework for GUI grounding.<n>Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements.<n>When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions.
arXiv Detail & Related papers (2025-06-12T03:13:21Z)
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z)
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis [59.83524388782554]
Graphical user interface (GUI) grounding remains a critical bottleneck in computer use agent development.<n>We introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types.<n>We synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples.
arXiv Detail & Related papers (2025-05-19T15:09:23Z)
Improved GUI Grounding via Iterative Narrowing [0.03375622857152329]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.