R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding
- URL: http://arxiv.org/abs/2507.05673v1
- Date: Tue, 08 Jul 2025 04:56:57 GMT
- Title: R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding
- Authors: Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R. Manmatha, Shabnam Ghadar,
- Abstract summary: A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
- Score: 18.100091500983044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.
Related papers
- Test-Time Reinforcement Learning for GUI Grounding via Region Consistency [17.954613936413942]
We propose a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions.<n>We also introduce GUI-RCPO, which transforms these consistency patterns into rewards for test-time reinforcement learning.<n>Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
arXiv Detail & Related papers (2025-08-07T17:54:27Z) - GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z) - AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z) - ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search [53.40810298627443]
ReGUIDE is a framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism.<n>Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks.
arXiv Detail & Related papers (2025-05-21T08:36:18Z) - GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents [13.415165482033395]
Out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of GUI agents may suffer task breakdowns or pose security threats.<n>Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments.<n>We propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI Agent that reflect its capability boundary.
arXiv Detail & Related papers (2025-05-19T08:29:05Z) - Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents.<n>Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy.<n>We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z) - GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents [16.72683291432717]
name is the first reinforcement learning framework designed to enhance the capabilities of LVLMs in high-level real-world task scenarios.<n>name achieves superior performance using only 0.02% of the data compared to previous state-of-the-art methods like OS-Atlas.
arXiv Detail & Related papers (2025-04-14T17:45:54Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Improved GUI Grounding via Iterative Narrowing [0.03922370499388702]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z) - Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces [1.3107174618549584]
Instruction Visual Grounding (IVG) is a multi-modal approach to object identification within a Graphical User Interface (GUI)<n>We propose IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and IVGdirect, which uses a multimodal architecture for end-to-end grounding.<n>Our final test dataset is publicly released to support future research.
arXiv Detail & Related papers (2024-05-05T19:10:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.