SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- URL: http://arxiv.org/abs/2401.10935v2
- Date: Fri, 23 Feb 2024 04:36:51 GMT
- Title: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- Authors: Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing
Zhang, Zhiyong Wu
- Abstract summary: We propose a novel visual Graphical User Interface (GUI) agent, SeeClick, which only relies on screenshots for task automation.
To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data.
We have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments.
- Score: 17.43878828389188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graphical User Interface (GUI) agents are designed to automate complex tasks
on digital devices, such as smartphones and desktops. Most existing GUI agents
interact with the environment through extracted structured data, which can be
notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops).
To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which
only relies on screenshots for task automation. In our preliminary study, we
have discovered a key challenge in developing visual GUI agents: GUI grounding
-- the capacity to accurately locate screen elements based on instructions. To
tackle this challenge, we propose to enhance SeeClick with GUI grounding
pre-training and devise a method to automate the curation of GUI grounding
data. Along with the efforts above, we have also created ScreenSpot, the first
realistic GUI grounding benchmark that encompasses mobile, desktop, and web
environments. After pre-training, SeeClick demonstrates significant improvement
in ScreenSpot over various baselines. Moreover, comprehensive evaluations on
three widely used benchmarks consistently support our finding that advancements
in GUI grounding directly correlate with enhanced performance in downstream GUI
agent tasks. The model, data and code are available at
https://github.com/njucckevin/SeeClick.
Related papers
- ScaleTrack: Scaling and back-tracking Automated GUI Agents [11.046190201201348]
We propose ScaleTrack, a training framework by scaling grounding and backtracking planning for automated GUI agents.
We collect GUI samples of different synthesis criterions from a wide range of sources, and unified them into the same template for training GUI grounding models.
We design a novel training strategy that predicts the next action from the current GUI image, while also backtracking the historical actions that led to the GUI image.
arXiv Detail & Related papers (2025-05-01T09:27:13Z) - TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [70.06743063375121]
We propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials.
We produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications.
We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks.
arXiv Detail & Related papers (2025-04-17T06:15:56Z) - WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation [20.11855701656702]
We present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions.
We also propose GUI-Thinker, a holistic framework, that effectively manages the unpredictability and complexity of GUI interactions.
arXiv Detail & Related papers (2025-02-12T01:06:10Z) - WinClick: GUI Grounding with Multimodal Large Language Models [46.44235543835595]
We introduce WinClick, a novel visual GUI agent developed in Windows platform.
To overcome the challenge of GUI grounding, we enhance WinClick with GUI grounding pre-training.
We also introduce WinSpot, the first comprehensive benchmark for GUI grounding on Windows.
arXiv Detail & Related papers (2025-01-27T08:29:17Z) - GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration [56.58744345634623]
We propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration.
We also introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments.
arXiv Detail & Related papers (2025-01-23T18:16:21Z) - Zero-Shot Prompting Approaches for LLM-based Graphical User Interface Generation [53.1000575179389]
We propose a Retrieval-Augmented GUI Generation (RAGG) approach, integrated with an LLM-based GUI retrieval re-ranking and filtering mechanism.
In addition, we adapt Prompt Decomposition (PDGG) and Self-Critique (SCGG) for GUI generation.
Our evaluation, which encompasses over 3,000 GUI annotations from over 100 crowd-workers with UI/UX experience, shows that SCGG, in contrast to PDGG and RAGG, can lead to more effective GUI generation.
arXiv Detail & Related papers (2024-12-15T22:17:30Z) - Ponder & Press: Advancing Visual GUI Agent towards General Computer Control [13.39115823642937]
Ponder & Press is a divide-and-conquer framework for general computer control using only visual input.
Our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications.
arXiv Detail & Related papers (2024-12-02T08:35:31Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.37173845836839]
OS-Atlas is a foundational GUI action model that excels at GUI grounding and OOD agentic tasks.
We are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
arXiv Detail & Related papers (2024-10-30T17:10:19Z) - Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [20.08996257335876]
We advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI.
We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots.
We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models.
arXiv Detail & Related papers (2024-10-07T17:47:50Z) - Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding [30.624179161014283]
We propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the ScreenPR task.
Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree.
Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements.
arXiv Detail & Related papers (2024-06-27T15:34:16Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks.
Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software.
Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z) - From Pixels to UI Actions: Learning to Follow Instructions via Graphical
User Interfaces [66.85108822706489]
This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use.
It is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
arXiv Detail & Related papers (2023-05-31T23:39:18Z) - GUILGET: GUI Layout GEneration with Transformer [26.457270239234383]
The goal is to support the initial step of GUI design by producing realistic and diverse GUI layouts.
GUILGET is based on transformers in order to capture the semantic in relationships between elements from GUI-AG.
Our experiments, which are conducted on the CLAY dataset, reveal that our model has the best understanding of relationships from GUI-AG.
arXiv Detail & Related papers (2023-04-18T14:27:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.