Related papers: \textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

URL: http://arxiv.org/abs/2510.04039v1
Date: Sun, 05 Oct 2025 05:15:45 GMT
Title: \textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding
Authors: Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, Caiwen Ding,
Abstract summary: We introduce GUI-Spotlight, a model trained for image-grounded reasoning.<n>It iteratively narrows its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy.<n>On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8% accuracy.
Score: 37.69847052653875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\% accuracy, surpassing V2P-7B (50.6\% with 9.6M training samples) and GTA-1-7B (50.1\% with 1.56M training samples).

Related papers

POINTS-GUI-G: GUI-Grounding Journey [22.35782799756431]
We introduce POINTS-GUIG-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpotPro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UIVision.<n>Our model's success is driven by three key factors: (1) Refined Data Engineering; (2) Improved Training Strategies; and (3) Reinforcement Learning with Verifiable Rewards.
arXiv Detail & Related papers (2026-02-06T05:14:11Z)
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding [71.97466930670936]
Grounding is a fundamental capability for building graphical user interface (GUI) agents.<n>In this paper, we investigate zoom as a strong yet underexplored prior to GUI grounding, and propose a training-free method, ZoomClick.<n> Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models.
arXiv Detail & Related papers (2025-12-05T18:39:12Z)
GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding [44.598660921968595]
We propose an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding.<n>Gui-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals.<n>It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 59.6% on ScreenSpot-Pro, 63.8% on OSWorld-G and 91.5% on ScreenSpot-v2.
arXiv Detail & Related papers (2025-11-02T05:34:21Z)
Learning GUI Grounding with Spatial Reasoning from Visual Feedback [46.66862168972301]
We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function.<n>Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results.
arXiv Detail & Related papers (2025-09-25T20:38:01Z)
Test-Time Reinforcement Learning for GUI Grounding via Region Consistency [17.954613936413942]
We propose a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions.<n>We also introduce GUI-RCPO, which transforms these consistency patterns into rewards for test-time reinforcement learning.<n>Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
arXiv Detail & Related papers (2025-08-07T17:54:27Z)
R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z)
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z)
Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents.<n>Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy.<n>We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.