Related papers: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

URL: http://arxiv.org/abs/2501.11733v2
Date: Tue, 28 Jan 2025 16:58:02 GMT
Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji,
Abstract summary: Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience.<n>Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
Score: 85.48034185086169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.

Related papers

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis [70.39500621448383]
Open-world mobile manipulation task remains a challenge due to the need for generalization to open-ended instructions and environments.<n>We propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling.<n>We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model.
arXiv Detail & Related papers (2025-06-04T17:57:44Z)
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration [53.54951412651823]
Mobile-Agent-V is a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation. Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions. Results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks.
arXiv Detail & Related papers (2025-02-24T12:51:23Z)
MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions [45.7564684180131]
Mobile phone agents can assist people in automating daily tasks on their phones. Existing procedure-oriented agents struggle with cross-app instructions. We propose a self-evolving multi-agent framework named MobileSteward.
arXiv Detail & Related papers (2025-02-24T03:12:45Z)
Foundations and Recent Trends in Multimodal Mobile Agents: A Survey [57.677161006710065]
Mobile agents are essential for automating tasks in complex and dynamic mobile environments. Recent advancements enhance real-time adaptability and multimodal interaction. We categorize these advancements into two main approaches: prompt-based methods and training-based methods.
arXiv Detail & Related papers (2024-11-04T11:50:58Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents. SPA-Bench offers three key contributions: A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines. A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
MobA: A Two-Level Agent System for Efficient Mobile Task Automation [22.844404052755294]
MobA is a novel Mobile phone Agent powered by multimodal large language models. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA.
arXiv Detail & Related papers (2024-10-17T16:53:50Z)
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices [17.702068044185086]
This paper introduces MobileExperts, which for the first time introduces tool formulation and multi-agent collaboration. We develop a dual-layer planning mechanism to establish coordinate collaboration among experts. Experimental results demonstrate that MobileExperts performs better on all intelligence levels and achieves 22% reduction in reasoning costs.
arXiv Detail & Related papers (2024-07-04T13:12:19Z)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [52.5831204440714]
We introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. It then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step.
arXiv Detail & Related papers (2024-01-29T13:46:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.