AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- URL: http://arxiv.org/abs/2407.17490v1
- Date: Wed, 3 Jul 2024 17:59:58 GMT
- Title: AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
- Authors: Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, Hongsheng Li,
- Abstract summary: We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios.
AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels.
AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
- Score: 50.39555842254652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents. Their capabilities of completing complex tasks by directly interacting with the graphical user interface (GUI) on mobile devices are trained and evaluated with the proposed dataset. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. Unlike existing mobile device-control datasets, e.g., MoTIF, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions, each averaging 13 steps with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we develop a baseline model SPHINX Agent and compare its performance across state-of-the-art agents trained on other datasets. To facilitate further research, we open-source our dataset, models, and relevant evaluation tools. The project is available at https://yuxiangchai.github.io/AMEX/
Related papers
- AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale.
Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements.
We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.
In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - A3: Android Agent Arena for Mobile GUI Agents [46.73085454978007]
Mobile GUI agents are designed to autonomously perform tasks on mobile devices.
Android Agent Arena (A3) is a novel evaluation platform for assessing performance on real-world, in-the-wild tasks.
A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios.
arXiv Detail & Related papers (2025-01-02T09:03:56Z) - Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments.
Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms.
We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.
Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.
To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data [15.801018643716437]
This paper aims to enhance the GUI understanding and interacting capabilities of large vision-language models (LVLMs) through a data-driven approach.
We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web.
Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work.
arXiv Detail & Related papers (2024-10-25T10:46:17Z) - MobileFlow: A Multimodal LLM For Mobile GUI Agent [4.7619361168442005]
This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents.
MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders.
It has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks.
arXiv Detail & Related papers (2024-07-05T08:37:10Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - GUIDE: Graphical User Interface Data for Execution [0.0]
GUIDE is a novel dataset tailored for the advancement of Multimodal Large Language Model (MLLM) applications.
Our dataset encompasses diverse data from various websites including Apollo(62.67%), Gmail(.43%), Calendar(22.92%)
arXiv Detail & Related papers (2024-04-09T11:59:41Z) - Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements.
This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs.
We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.