How Auxiliary Reasoning Unleashes GUI Grounding in VLMs
- URL: http://arxiv.org/abs/2509.11548v1
- Date: Mon, 15 Sep 2025 03:28:29 GMT
- Title: How Auxiliary Reasoning Unleashes GUI Grounding in VLMs
- Authors: Weiming Li, Yan Shao, Jing Yang, Yujing Lu, Ling Zhong, Yuhan Wang, Manni Duan,
- Abstract summary: General vision-language models (VLMs) struggle with this task due to a lack of specific optimization.<n>We propose three zero-shot auxiliary reasoning methods to address this discrepancy.<n>We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs.
- Score: 16.798199078199154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.
Related papers
- Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z) - Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding [53.14935624161711]
GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
arXiv Detail & Related papers (2025-09-29T00:06:31Z) - Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding [31.57375084036447]
Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning.<n>We propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities.<n>Our approach integrates Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data.
arXiv Detail & Related papers (2025-09-04T14:17:01Z) - R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z) - DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning [53.42606072841585]
We introduce DiMo-GUI, a training-free framework for GUI grounding.<n>Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements.<n>When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions.
arXiv Detail & Related papers (2025-06-12T03:13:21Z) - GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z) - ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search [53.40810298627443]
ReGUIDE is a framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism.<n>Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks.
arXiv Detail & Related papers (2025-05-21T08:36:18Z) - Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [101.09478572153239]
We propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time.<n>This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments.
arXiv Detail & Related papers (2025-04-22T17:52:42Z) - Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks [20.31857138247549]
Perception-enhanced pre-training is widely adopted to enhance the performance of graphical user interface (GUI) agents.<n>We propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning.<n>We show that query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data.
arXiv Detail & Related papers (2025-03-01T08:29:59Z) - Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning [29.47233232259932]
We propose a tuning-free Attention-driven Grounding (TAG) method that leverages inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning.<n>Our method achieves performance comparable to tuning-based methods, with notable success in text localization.<n>We demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5.
arXiv Detail & Related papers (2024-12-14T14:30:05Z) - Improved GUI Grounding via Iterative Narrowing [0.03375622857152329]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z) - Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves? [61.899791071654654]
We investigate whether Vision-Language Models (VLMs) can improve their semantic grounding by "receiving" feedback.<n>We find that if prompted appropriately, VLMs can utilize feedback both in a single step and iteratively.<n>We show grounding accuracy consistently improves using automated feedback across all models in all settings investigated.
arXiv Detail & Related papers (2024-04-09T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.