VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
- URL: http://arxiv.org/abs/2512.16501v1
- Date: Thu, 18 Dec 2025 13:09:09 GMT
- Title: VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
- Authors: Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen,
- Abstract summary: We present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding.<n>VenusBench-GD spans multiple platforms, enabling hierarchical evaluation for real-word applications.
- Score: 20.241892335459497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.
Related papers
- GEBench: Benchmarking Image Generation Models as GUI Environments [49.513441724802135]
We introduce GEBench, a benchmark for evaluating dynamic interaction and temporal coherence in GUI generation.<n>GE-Score is a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.<n>Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks.
arXiv Detail & Related papers (2026-02-09T18:52:02Z) - Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding [71.97466930670936]
Grounding is a fundamental capability for building graphical user interface (GUI) agents.<n>In this paper, we investigate zoom as a strong yet underexplored prior to GUI grounding, and propose a training-free method, ZoomClick.<n> Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models.
arXiv Detail & Related papers (2025-12-05T18:39:12Z) - Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding [53.14935624161711]
GMS: Generalist Scanner Meets Specialist Locator is a synergistic coarse-to-fine framework that effectively improves GUI grounding performance.<n>This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization.<n> Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0%$ and $3.7%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7%$.
arXiv Detail & Related papers (2025-09-29T00:06:31Z) - RemoteSAM: Towards Segment Anything for Earth Observation [29.707796048411705]
We aim to develop a robust yet flexible visual foundation model for Earth observation.<n>It should possess strong capabilities in recognizing and localizing diverse visual targets.<n>We present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks.
arXiv Detail & Related papers (2025-05-23T15:27:57Z) - Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis [57.371814877372515]
Graphical user interface (GUI) grounding remains a critical bottleneck in computer use agent development.<n>We introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types.<n>We synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples.
arXiv Detail & Related papers (2025-05-19T15:09:23Z) - UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis [15.429065788185522]
Vision-based Graphical User Interface (GUI) agents utilize human-like vision perception capabilities to enhance productivity on digital devices.<n>In this paper, we explore unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction.<n>We propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks.
arXiv Detail & Related papers (2025-04-15T14:56:21Z) - ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use [47.568491119335924]
We introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings.<n>The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations.<n>We propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training.
arXiv Detail & Related papers (2025-04-04T14:25:17Z) - Improved GUI Grounding via Iterative Narrowing [0.03375622857152329]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z) - BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.