VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- URL: http://arxiv.org/abs/2406.10227v1
- Date: Fri, 14 Jun 2024 17:59:08 GMT
- Title: VideoGUI: A Benchmark for GUI Automation from Instructional Videos
- Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou,
- Abstract summary: VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks.
Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software.
Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
- Score: 78.97292966276706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning.
Related papers
- GUI Action Narrator: Where and When Did That Action Take Place? [19.344324166716245]
We develop a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples.
This task presents unique challenges compared to natural scene video captioning.
We introduce our GUI action dataset textbfAct2Cap as well as a simple yet effective framework, textbfGUI Narrator, for GUI video captioning.
arXiv Detail & Related papers (2024-06-19T17:22:11Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces [27.84098739594353]
Graph4GUI exploits graph neural networks to capture individual elements' properties and semantic-visuo-spatial constraints in a layout.
The learned representation demonstrated its effectiveness in multiple tasks, especially generating designs in a challenging GUI autocompletion task.
arXiv Detail & Related papers (2024-04-21T04:06:09Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z) - SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents [17.43878828389188]
We propose a novel visual Graphical User Interface (GUI) agent, SeeClick, which only relies on screenshots for task automation.
To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data.
We have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments.
arXiv Detail & Related papers (2024-01-17T08:10:35Z) - ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation [30.693616802332745]
This paper presents a novel benchmark, AssistGUI, to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks.
We propose an advanced Actor-Critic framework, which incorporates a sophisticated GUI driven by an AI agent and adept at handling lengthy procedural tasks.
arXiv Detail & Related papers (2023-12-20T15:28:38Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - A Unified Sequence Interface for Vision Tasks [87.328893553186]
We show that a diverse set of "core" computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.
We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs.
We show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization.
arXiv Detail & Related papers (2022-06-15T17:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.