Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
- URL: http://arxiv.org/abs/2505.13227v3
- Date: Fri, 24 Oct 2025 19:08:03 GMT
- Title: Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
- Authors: Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong,
- Abstract summary: Graphical user interface (GUI) grounding remains a critical bottleneck in computer use agent development.<n>We introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types.<n>We synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples.
- Score: 57.371814877372515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.
Related papers
- Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding [71.97466930670936]
Grounding is a fundamental capability for building graphical user interface (GUI) agents.<n>In this paper, we investigate zoom as a strong yet underexplored prior to GUI grounding, and propose a training-free method, ZoomClick.<n> Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models.
arXiv Detail & Related papers (2025-12-05T18:39:12Z) - SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios [7.983317067810301]
SWITCH (Semantic World Interface Tasks for Control and Handling) is an embodied, task-driven benchmark created through iterative releases to probe these gaps.<n>It evaluates five complementary abilities:task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification.<n>Across 351 tasks spanning 98 real devices and appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions.
arXiv Detail & Related papers (2025-11-20T09:52:20Z) - MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements [7.2364254826655925]
MEGA-GUI is a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding.<n> MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity.<n>On the visually dense ScreenSpot-Pro benchmark, MEGA-GUI attains 73.18% accuracy, and on the semantically complex OSWorld-G benchmark it reaches 68.63%, surpassing previously reported results.
arXiv Detail & Related papers (2025-11-17T07:38:05Z) - Grounding Computer Use Agents on Human Demonstrations [40.66362945241247]
GroundCUA is a large-scale desktop grounding dataset built from expert human demonstrations.<n>It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations.<n>Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements.
arXiv Detail & Related papers (2025-11-10T17:35:21Z) - Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding [53.14935624161711]
GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
arXiv Detail & Related papers (2025-09-29T00:06:31Z) - R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z) - ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search [53.40810298627443]
ReGUIDE is a framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism.<n>Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks.
arXiv Detail & Related papers (2025-05-21T08:36:18Z) - UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis [15.429065788185522]
We introduce a large-scale data synthesis pipeline UI-E2I- Synth for generating varying complex instruction datasets.<n>We propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks.<n>Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding.
arXiv Detail & Related papers (2025-04-15T14:56:21Z) - UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction [16.731754927372585]
We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents.<n>Unlike online benchmarks, UI-Vision provides dense, high-quality annotations of human demonstrations.<n>Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B.
arXiv Detail & Related papers (2025-03-19T19:26:17Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - OmniParser for Pure Vision Based GUI Agent [37.911094082816504]
Power multimodal models like GPT-4V as a general agent on multiple operating systems are largely underestimated due to the lack of a robust screen parsing technique.
textsc Omni significantly improves GPT-4V's performance on ScreenSpot benchmark.
textsc Omni with screenshot only outperforms GPT-4V baselines requiring additional information outside of screenshot.
arXiv Detail & Related papers (2024-08-01T00:00:43Z) - Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces [1.3107174618549584]
Instruction Visual Grounding (IVG) is a multi-modal approach to object identification within a Graphical User Interface (GUI)<n>We propose IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and IVGdirect, which uses a multimodal architecture for end-to-end grounding.<n>Our final test dataset is publicly released to support future research.
arXiv Detail & Related papers (2024-05-05T19:10:19Z) - OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents.
OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks.
We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills [24.150758623016195]
We present ManiSkill2, the next generation of the SAPIEN ManiSkill benchmark for generalizable manipulation skills.
ManiSkill2 includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames.
It defines a unified interface and evaluation protocol to support a wide range of algorithms.
It empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS.
arXiv Detail & Related papers (2023-02-09T14:24:01Z) - First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual
Information Maximization [112.40598205054994]
We formalize this idea as a completely unsupervised objective for optimizing interfaces.
We conduct an observational study on 540K examples of users operating various keyboard and eye gaze interfaces for typing, controlling simulated robots, and playing video games.
The results show that our mutual information scores are predictive of the ground-truth task completion metrics in a variety of domains.
arXiv Detail & Related papers (2022-05-24T21:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.