Related papers: GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

URL: http://arxiv.org/abs/2406.10819v1
Date: Sun, 16 Jun 2024 06:56:53 GMT
Title: GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Authors: Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun,
Abstract summary: This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
Score: 73.9254861755974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.

Related papers

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [70.06743063375121]
We propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. We produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks.
arXiv Detail & Related papers (2025-04-17T06:15:56Z)
TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents [0.6827423171182154]
TRISHUL is a training-free framework that enhances generalist LVLMs for holistic GUI comprehension. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. For GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
arXiv Detail & Related papers (2025-02-12T09:12:30Z)
GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration [56.58744345634623]
We propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration. We also introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments.
arXiv Detail & Related papers (2025-01-23T18:16:21Z)
Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments. Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms. We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z)
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control [13.39115823642937]
Ponder & Press is a divide-and-conquer framework for general computer control using only visual input. Our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications.
arXiv Detail & Related papers (2024-12-02T08:35:31Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.37173845836839]
OS-Atlas is a foundational GUI action model that excels at GUI grounding and OOD agentic tasks. We are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
arXiv Detail & Related papers (2024-10-30T17:10:19Z)
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [20.08996257335876]
We advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models.
arXiv Detail & Related papers (2024-10-07T17:47:50Z)
MobileFlow: A Multimodal LLM For Mobile GUI Agent [4.7619361168442005]
This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents. MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders. It has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks.
arXiv Detail & Related papers (2024-07-05T08:37:10Z)
GUI Action Narrator: Where and When Did That Action Take Place? [19.344324166716245]
We develop a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning. We introduce our GUI action dataset textbfAct2Cap as well as a simple yet effective framework, textbfGUI Narrator, for GUI video captioning.
arXiv Detail & Related papers (2024-06-19T17:22:11Z)
GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software. Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z)
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP) With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents [17.43878828389188]
We propose a novel visual Graphical User Interface (GUI) agent, SeeClick, which only relies on screenshots for task automation. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. We have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments.
arXiv Detail & Related papers (2024-01-17T08:10:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.