ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search
- URL: http://arxiv.org/abs/2505.15259v2
- Date: Sat, 24 May 2025 15:08:35 GMT
- Title: ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search
- Authors: Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, Kang Min Yoo,
- Abstract summary: ReGUIDE is a framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism.<n>Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks.
- Score: 53.40810298627443
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is often required for fine-grained actions. However, this remains significantly challenging, leading prior works to rely on large-scale web datasets to improve the grounding accuracy. In this work, we propose Reasoning Graphical User Interface Grounding for Data Efficiency (ReGUIDE), a novel and effective framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism. More specifically, ReGUIDE learns to (i) self-generate a language reasoning process for the localization via online reinforcement learning, and (ii) criticize the prediction using spatial priors that enforce equivariance under input transformations. At inference time, ReGUIDE further boosts performance through a test-time scaling strategy, which combines spatial search with coordinate aggregation. Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks, outperforming baselines with substantially fewer training data points (e.g., only 0.2% samples compared to the best open-sourced baselines).
Related papers
- LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization [58.65395773049273]
Location Preference Optimization (LPO) is a novel approach that leverages locational data to optimize interaction preferences.<n>LPO uses information entropy to predict interaction positions by focusing on zones rich in information.<n>Our code will be made publicly available soon, at https://github.com/AIDC-AI/LPO.
arXiv Detail & Related papers (2025-06-11T03:43:30Z) - What Really Matters for Learning-based LiDAR-Camera Calibration [50.2608502974106]
This paper revisits the development of learning-based LiDAR-Camera calibration.<n>We identify the critical limitations of regression-based methods with the widely used data generation pipeline.<n>We also investigate how the input data format and preprocessing operations impact network performance.
arXiv Detail & Related papers (2025-01-28T14:12:32Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit? [15.550663626482903]
We investigate the efficacy of data augmentations to close the domain gap in spaceborne computer vision.
We propose two novel data augmentations specifically developed to emulate the visual effects observed in orbital imagery.
arXiv Detail & Related papers (2024-10-21T08:24:46Z) - SPADES: A Realistic Spacecraft Pose Estimation Dataset using Event
Sensing [9.583223655096077]
Due to limited access to real target datasets, algorithms are often trained using synthetic data and applied in the real domain.
Event sensing has been explored in the past and shown to reduce the domain gap between simulations and real-world scenarios.
We introduce a novel dataset, SPADES, comprising real event data acquired in a controlled laboratory environment and simulated event data using the same camera intrinsics.
arXiv Detail & Related papers (2023-11-09T12:14:47Z) - Latent Task-Specific Graph Network Simulators [16.881339139068018]
Graph Network Simulators (GNSs) pose an efficient alternative to traditional physics-based simulators.
We frame mesh-based simulation as a meta-learning problem and use a recent Bayesian meta-learning method to improve GNSs adaptability to new scenarios.
We validate the effectiveness of our approach through various experiments, performing on par with or better than established baseline methods.
arXiv Detail & Related papers (2023-11-09T10:30:51Z) - Simple and Effective Augmentation Methods for CSI Based Indoor
Localization [37.3026733673066]
We propose two algorithms for channel state information based indoor localization motivated by physical considerations.
As little as 10% of the original dataset size is enough to get the same performance as the original dataset.
If we further augment the dataset with the proposed techniques, test accuracy is improved more than three-fold.
arXiv Detail & Related papers (2022-11-19T20:27:46Z) - RAIS: Robust and Accurate Interactive Segmentation via Continual
Learning [16.382862088005087]
We propose RAIS, a robust and accurate architecture for interactive segmentation with continuous learning.
For efficient learning on the test set, we propose a novel optimization strategy to update global and local parameters.
Our method also shows its robustness in the datasets of remote sensing and medical imaging.
arXiv Detail & Related papers (2022-10-20T03:05:44Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.