MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- URL: http://arxiv.org/abs/2406.08184v1
- Date: Wed, 12 Jun 2024 13:14:50 GMT
- Title: MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
- Authors: Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen,
- Abstract summary: Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs)
Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents.
We propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing.
- Score: 7.4568642040547894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) and their potential to autonomously manage daily tasks. Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents, due to the inexhaustible states of apps and the vague definition of feasible action sequences. To address this challenge, we propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing. We initially define 100 tasks across 10 open-source apps, categorized by multiple levels of difficulty. Subsequently, we evaluate several existing mobile agents, including AppAgent and MobileAgent, to thoroughly and systematically compare their performance. All materials are accessible on our project webpage: https://MobileAgentBench.github.io, contributing to the advancement of both academic and industrial fields.
Related papers
- Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence [79.5316642687565]
Existing multi-agent frameworks often struggle with integrating diverse capable third-party agents.
We propose the Internet of Agents (IoA), a novel framework that addresses these limitations.
IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control.
arXiv Detail & Related papers (2024-07-09T17:33:24Z) - CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [52.83132876539399]
Crab is the first benchmark framework designed to support cross-environment tasks.
Our framework supports multiple devices and can be easily extended to any environment with a Python interface.
The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 35.
arXiv Detail & Related papers (2024-07-01T17:55:04Z) - Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents [46.81304373693033]
Large language models (LLMs) have become a research hotspot in human-computer interaction.
Mobile-Bench is a novel benchmark for evaluating the capabilities of LLM-based mobile agents.
arXiv Detail & Related papers (2024-07-01T06:10:01Z) - Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance.
The architecture comprises three agents: planning agent, decision agent, and reflection agent.
We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z) - Benchmarking Mobile Device Control Agents across Diverse Configurations [21.164023091324523]
B-MoCA is a novel benchmark for evaluating mobile device control agents.
We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations.
arXiv Detail & Related papers (2024-04-25T14:56:32Z) - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [52.5831204440714]
We introduce Mobile-Agent, an autonomous multi-modal mobile device agent.
Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface.
It then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step.
arXiv Detail & Related papers (2024-01-29T13:46:37Z) - MobileAgent: enhancing mobile control via human-machine interaction and
SOP integration [0.0]
Large Language Models (LLMs) are now capable of automating mobile device operations for users.
Privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation.
We have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs.
Our approach is evaluated on the new device control benchmark AitW, which encompasses 30K unique instructions across multi-step tasks.
arXiv Detail & Related papers (2024-01-04T03:44:42Z) - AppAgent: Multimodal Agents as Smartphone Users [23.318925173980446]
Our framework enables the agent to operate smartphone applications through a simplified action space.
The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations.
To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications.
arXiv Detail & Related papers (2023-12-21T11:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.