Related papers: GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

URL: http://arxiv.org/abs/2406.08451v1
Date: Wed, 12 Jun 2024 17:44:26 GMT
Title: GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Authors: Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo,
Abstract summary: GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. We developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module.
Score: 61.48043339441149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising simple tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we introduce GUI Odyssey, a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos. Leveraging GUI Odyssey, we developed OdysseyAgent, a multimodal cross-app navigation agent by fine-tuning the Qwen-VL model with a history resampling module. Extensive experiments demonstrate OdysseyAgent's superior accuracy compared to existing models. For instance, OdysseyAgent surpasses fine-tuned Qwen-VL and zero-shot GPT-4V by 1.44\% and 55.49\% in-domain accuracy, and 2.29\% and 48.14\% out-of-domain accuracy on average. The dataset and code will be released in \url{https://github.com/OpenGVLab/GUI-Odyssey}.

Related papers

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents [57.59830804627066]
We introduce MONDAY, a large-scale dataset of 313K annotated frames from 20K instructional videos capturing real-world mobile OS navigation.<n>Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities.<n>We present an automated framework that leverages publicly available video content to create comprehensive task datasets.
arXiv Detail & Related papers (2025-05-19T02:39:03Z)
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [70.06743063375121]
We propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. We produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks.
arXiv Detail & Related papers (2025-04-17T06:15:56Z)
Breaking the Data Barrier -- Building GUI Agents Through Task Generalization [25.129269032612832]
We propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage.<n>We explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning.<n>Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges.
arXiv Detail & Related papers (2025-04-14T11:35:02Z)
Explorer: Robust Collection of Interactable GUI Elements [0.0]
We show how a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. Proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, where the training process has access to a live version of the application. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states.
arXiv Detail & Related papers (2025-04-12T22:02:29Z)
MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions [45.7564684180131]
Mobile phone agents can assist people in automating daily tasks on their phones.<n>Existing procedure-oriented agents struggle with cross-app instructions.<n>We propose a self-evolving multi-agent framework named MobileSteward.
arXiv Detail & Related papers (2025-02-24T03:12:45Z)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions. In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z)
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience.<n>Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z)
Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments. Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms. We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
MobileFlow: A Multimodal LLM For Mobile GUI Agent [4.7619361168442005]
This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents. MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders. It has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks.
arXiv Detail & Related papers (2024-07-05T08:37:10Z)
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios. AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels. AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z)
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z)
CogAgent: A Visual Language Model for GUI Agents [61.26491779502794]
We introduce CogAgent, a visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120. CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE.
arXiv Detail & Related papers (2023-12-14T13:20:57Z)
Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing [17.24045904273874]
We propose DroidAgent, an autonomous GUI testing agent for Android. It is based on Large Language Models and support mechanisms such as long- and short-term memory. DroidAgent achieved 61% activity coverage, compared to 51% for current state-of-the-art GUI testing techniques.
arXiv Detail & Related papers (2023-11-15T01:59:40Z)
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [167.6232690168905]
MM-Navigator is a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions.
arXiv Detail & Related papers (2023-11-13T18:53:37Z)
NiCro: Purely Vision-based, Non-intrusive Cross-Device and Cross-Platform GUI Testing [19.462053492572142]
We propose a non-intrusive cross-device and cross-platform system NiCro. NiCro uses the state-of-the-art GUI widget detector to detect widgets from GUI images and then analyses a set of comprehensive information to match the widgets across diverse devices. At the system level, NiCro can interact with a virtual device farm and a robotic arm system to perform cross-device, cross-platform testing non-intrusively.
arXiv Detail & Related papers (2023-05-24T01:19:05Z)
Towards Versatile Embodied Navigation [120.73460380993305]
Vienna is a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. We empirically demonstrate that, compared with learning each visual navigation task individually, our agent achieves comparable or even better performance with reduced complexity.
arXiv Detail & Related papers (2022-10-30T11:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.