GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- URL: http://arxiv.org/abs/2406.08451v1
- Date: Wed, 12 Jun 2024 17:44:26 GMT
- Title: GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- Authors: Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo,
- Abstract summary: GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos.
We developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module.
- Score: 61.48043339441149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising simple tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we introduce GUI Odyssey, a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. Leveraging GUI Odyssey, we developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module. Extensive experiments demonstrate OdysseyAgent's superior accuracy compared to existing models. For instance, OdysseyAgent surpasses fine-tuned Qwen-VL and zero-shot GPT-4V by 1.44\% and 55.49\% in-domain accuracy, and 2.29\% and 48.14\% out-of-domain accuracy on average. The dataset and code will be released in \url{https://github.com/OpenGVLab/GUI-Odyssey}.
Related papers
- MobileFlow: A Multimodal LLM For Mobile GUI Agent [4.7619361168442005]
This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents.
MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders.
It has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks.
arXiv Detail & Related papers (2024-07-05T08:37:10Z) - AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios.
AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels.
AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - CogAgent: A Visual Language Model for GUI Agents [61.26491779502794]
We introduce CogAgent, a visual language model (VLM) specializing in GUI understanding and navigation.
By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120.
CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE.
arXiv Detail & Related papers (2023-12-14T13:20:57Z) - Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI
Testing [17.24045904273874]
We propose DroidAgent, an autonomous GUI testing agent for Android.
It is based on Large Language Models and support mechanisms such as long- and short-term memory.
DroidAgent achieved 61% activity coverage, compared to 51% for current state-of-the-art GUI testing techniques.
arXiv Detail & Related papers (2023-11-15T01:59:40Z) - GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone
GUI Navigation [167.6232690168905]
MM-Navigator is a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task.
MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions.
arXiv Detail & Related papers (2023-11-13T18:53:37Z) - NiCro: Purely Vision-based, Non-intrusive Cross-Device and
Cross-Platform GUI Testing [19.462053492572142]
We propose a non-intrusive cross-device and cross-platform system NiCro.
NiCro uses the state-of-the-art GUI widget detector to detect widgets from GUI images and then analyses a set of comprehensive information to match the widgets across diverse devices.
At the system level, NiCro can interact with a virtual device farm and a robotic arm system to perform cross-device, cross-platform testing non-intrusively.
arXiv Detail & Related papers (2023-05-24T01:19:05Z) - Towards Versatile Embodied Navigation [120.73460380993305]
Vienna is a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model.
We empirically demonstrate that, compared with learning each visual navigation task individually, our agent achieves comparable or even better performance with reduced complexity.
arXiv Detail & Related papers (2022-10-30T11:53:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.