Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
- URL: http://arxiv.org/abs/2508.05615v1
- Date: Thu, 07 Aug 2025 17:54:27 GMT
- Title: Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
- Authors: Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, Yongliang Shen,
- Abstract summary: We propose a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions.<n>We also introduce GUI-RCPO, which transforms these consistency patterns into rewards for test-time reinforcement learning.<n>Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
- Score: 17.954613936413942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
Related papers
- GUI-ReRank: Enhancing GUI Retrieval with Multi-Modal LLM-based Reranking [55.762798168494726]
GUI-ReRank is a novel framework that integrates rapid embedding-based constrained retrieval models with highly effective MLLM-based reranking techniques.<n>We evaluated our approach on an established NL-based GUI retrieval benchmark.
arXiv Detail & Related papers (2025-08-05T10:17:38Z) - GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding [51.497245303008015]
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction.<n>Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards.<n>We show that GUI-G$2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro.
arXiv Detail & Related papers (2025-07-21T17:53:42Z) - R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z) - DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning [52.37530640460363]
We introduce DiMo-GUI, a training-free framework for GUI grounding.<n>Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements.<n>When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions.
arXiv Detail & Related papers (2025-06-12T03:13:21Z) - GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z) - GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents [13.415165482033395]
Out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of GUI agents may suffer task breakdowns or pose security threats.<n>Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments.<n>We propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI agent that reflect its capability boundary.
arXiv Detail & Related papers (2025-05-19T08:29:05Z) - GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration [56.58744345634623]
We propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration.<n>We also introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments.
arXiv Detail & Related papers (2025-01-23T18:16:21Z) - Zero-Shot Prompting Approaches for LLM-based Graphical User Interface Generation [53.1000575179389]
We propose a Retrieval-Augmented GUI Generation (RAGG) approach, integrated with an LLM-based GUI retrieval re-ranking and filtering mechanism.<n>In addition, we adapt Prompt Decomposition (PDGG) and Self-Critique (SCGG) for GUI generation.<n>Our evaluation, which encompasses over 3,000 GUI annotations from over 100 crowd-workers with UI/UX experience, shows that SCGG, in contrast to PDGG and RAGG, can lead to more effective GUI generation.
arXiv Detail & Related papers (2024-12-15T22:17:30Z) - Improved GUI Grounding via Iterative Narrowing [0.03922370499388702]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.