ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
- URL: http://arxiv.org/abs/2504.07981v1
- Date: Fri, 04 Apr 2025 14:25:17 GMT
- Title: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
- Authors: Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua,
- Abstract summary: We introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings.<n>The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations.<n>We propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training.
- Score: 47.568491119335924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Multi-modal Large Language Models (MLLMs) have led to significant progress in developing GUI agents for general tasks such as web browsing and mobile phone use. However, their application in professional domains remains under-explored. These specialized workflows introduce unique challenges for GUI perception models, including high-resolution displays, smaller target sizes, and complex environments. In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. The benchmark comprises authentic high-resolution images from a variety of professional domains with expert annotations. It spans 23 applications across five industries and three operating systems. Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%. Our experiments reveal that strategically reducing the search area enhances accuracy. Based on this insight, we propose ScreenSeekeR, a visual search method that utilizes the GUI knowledge of a strong planner to guide a cascaded search, achieving state-of-the-art performance with 48.1% without any additional training. We hope that our benchmark and findings will advance the development of GUI agents for professional applications. Code, data and leaderboard can be found at https://gui-agent.github.io/grounding-leaderboard.
Related papers
- TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [70.06743063375121]
We propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials.
We produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications.
We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks.
arXiv Detail & Related papers (2025-04-17T06:15:56Z) - GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents [16.72683291432717]
name is the first reinforcement learning framework designed to enhance the capabilities of LVLMs in high-level real-world task scenarios.
name achieves superior performance using only 0.02% of the data compared to previous state-of-the-art methods like OS-Atlas.
arXiv Detail & Related papers (2025-04-14T17:45:54Z) - GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration [22.814882629516635]
We introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization.
To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning.
arXiv Detail & Related papers (2025-03-22T09:30:37Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Zero-Shot Prompting Approaches for LLM-based Graphical User Interface Generation [53.1000575179389]
We propose a Retrieval-Augmented GUI Generation (RAGG) approach, integrated with an LLM-based GUI retrieval re-ranking and filtering mechanism.<n>In addition, we adapt Prompt Decomposition (PDGG) and Self-Critique (SCGG) for GUI generation.<n>Our evaluation, which encompasses over 3,000 GUI annotations from over 100 crowd-workers with UI/UX experience, shows that SCGG, in contrast to PDGG and RAGG, can lead to more effective GUI generation.
arXiv Detail & Related papers (2024-12-15T22:17:30Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.<n>Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.<n>To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Improved GUI Grounding via Iterative Narrowing [0.03922370499388702]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z) - GUI Agents with Foundation Models: A Comprehensive Survey [91.97447457550703]
This survey consolidates recent research on (M)LLM-based GUI agents.<n>We identify key challenges and propose future research directions.<n>We hope this survey will inspire further advancements in the field of (M)LLM-based GUI agents.
arXiv Detail & Related papers (2024-11-07T17:28:10Z) - OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.37173845836839]
OS-Atlas is a foundational GUI action model that excels at GUI grounding and OOD agentic tasks.
We are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
arXiv Detail & Related papers (2024-10-30T17:10:19Z) - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [30.693616802332745]
This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks.
We propose an advanced Actor-Critic framework, which incorporates a sophisticated GUI driven by an AI agent and adept at handling lengthy procedural tasks.
arXiv Detail & Related papers (2023-12-20T15:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.