Related papers: MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

URL: http://arxiv.org/abs/2406.08184v1
Date: Wed, 12 Jun 2024 13:14:50 GMT
Title: MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents
Authors: Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen,
Abstract summary: Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents. We propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing.
Score: 7.4568642040547894
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) and their potential to autonomously manage daily tasks. Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents, due to the inexhaustible states of apps and the vague definition of feasible action sequences. To address this challenge, we propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing. We initially define 100 tasks across 10 open-source apps, categorized by multiple levels of difficulty. Subsequently, we evaluate several existing mobile agents, including AppAgent and MobileAgent, to thoroughly and systematically compare their performance. All materials are accessible on our project webpage: https://MobileAgentBench.github.io, contributing to the advancement of both academic and industrial fields.

Related papers

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.57043903478257]
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations.<n>With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality.<n>This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development.
arXiv Detail & Related papers (2025-08-06T14:33:45Z)
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration [53.54951412651823]
Mobile-Agent-V is a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation. Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions. Results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks.
arXiv Detail & Related papers (2025-02-24T12:51:23Z)
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience. Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z)
Foundations and Recent Trends in Multimodal Mobile Agents: A Survey [57.677161006710065]
Mobile agents are essential for automating tasks in complex and dynamic mobile environments. Recent advancements enhance real-time adaptability and multimodal interaction. We categorize these advancements into two main approaches: prompt-based methods and training-based methods.
arXiv Detail & Related papers (2024-11-04T11:50:58Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents. SPA-Bench offers three key contributions: A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines. A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents [50.12414817737912]
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. Existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments. VisualAgentBench (VAB) is a pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents.
arXiv Detail & Related papers (2024-08-12T17:44:17Z)
AppAgent v2: Advanced Agent for Flexible Mobile Interactions [46.789563920416626]
This work introduces a novel LLM-based multimodal agent framework for mobile devices. Our agent constructs a flexible action space that enhances adaptability across various applications. Our results demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios.
arXiv Detail & Related papers (2024-08-05T06:31:39Z)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z)
Benchmarking Mobile Device Control Agents across Diverse Configurations [19.01954948183538]
B-MoCA is a benchmark for evaluating and developing mobile device control agents. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness.
arXiv Detail & Related papers (2024-04-25T14:56:32Z)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [52.5831204440714]
We introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. It then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step.
arXiv Detail & Related papers (2024-01-29T13:46:37Z)
AppAgent: Multimodal Agents as Smartphone Users [23.318925173980446]
Our framework enables the agent to operate smartphone applications through a simplified action space. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications.
arXiv Detail & Related papers (2023-12-21T11:52:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.