Related papers: Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

URL: http://arxiv.org/abs/2406.01014v1
Date: Mon, 3 Jun 2024 05:50:00 GMT
Title: Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
Authors: Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang,
Abstract summary: We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
Score: 52.25473993987409
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.

Related papers

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis [70.39500621448383]
Open-world mobile manipulation task remains a challenge due to the need for generalization to open-ended instructions and environments.<n>We propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling.<n>We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model.
arXiv Detail & Related papers (2025-06-04T17:57:44Z)
EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation [36.08217588070538]
Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost.<n>We propose textbfEcoAgent, an textbfEdge-textbfCloud ctextbfOllaborative multi-agent framework for mobile automation.<n>EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes.
arXiv Detail & Related papers (2025-05-08T17:31:20Z)
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration [53.54951412651823]
Mobile-Agent-V is a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation. Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions. Results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks.
arXiv Detail & Related papers (2025-02-24T12:51:23Z)
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent. From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z)
ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation [11.931584529573176]
Given a task, mobile AI agents can interact with mobile devices in multiple steps and form a GUI flow that solves the task. To address this issue, we constructed a training dataset called MobileReach, which breaks the task into page reaching and operation subtasks. We propose ReachAgent, a two-stage framework that focuses on improving its task-completion abilities.
arXiv Detail & Related papers (2025-02-05T07:35:23Z)
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience. Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z)
Foundations and Recent Trends in Multimodal Mobile Agents: A Survey [57.677161006710065]
Mobile agents are essential for automating tasks in complex and dynamic mobile environments. Recent advancements enhance real-time adaptability and multimodal interaction. We categorize these advancements into two main approaches: prompt-based methods and training-based methods.
arXiv Detail & Related papers (2024-11-04T11:50:58Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents. SPA-Bench offers three key contributions: A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines. A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
MobA: A Two-Level Agent System for Efficient Mobile Task Automation [22.844404052755294]
MobA is a novel Mobile phone Agent powered by multimodal large language models. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA.
arXiv Detail & Related papers (2024-10-17T16:53:50Z)
AppAgent v2: Advanced Agent for Flexible Mobile Interactions [46.789563920416626]
This work introduces a novel LLM-based multimodal agent framework for mobile devices. Our agent constructs a flexible action space that enhances adaptability across various applications. Our results demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios.
arXiv Detail & Related papers (2024-08-05T06:31:39Z)
MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices [17.702068044185086]
This paper introduces MobileExperts, which for the first time introduces tool formulation and multi-agent collaboration. We develop a dual-layer planning mechanism to establish coordinate collaboration among experts. Experimental results demonstrate that MobileExperts performs better on all intelligence levels and achieves 22% reduction in reasoning costs.
arXiv Detail & Related papers (2024-07-04T13:12:19Z)
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [52.5831204440714]
We introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. It then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step.
arXiv Detail & Related papers (2024-01-29T13:46:37Z)
MobileAgent: enhancing mobile control via human-machine interaction and SOP integration [0.0]
Large Language Models (LLMs) are now capable of automating mobile device operations for users. Privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation. We have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs. Our approach is evaluated on the new device control benchmark AitW, which encompasses 30K unique instructions across multi-step tasks.
arXiv Detail & Related papers (2024-01-04T03:44:42Z)
Error-Aware Imitation Learning from Teleoperation Data for Mobile Manipulation [54.31414116478024]
In mobile manipulation (MM), robots can both navigate within and interact with their environment. In this work, we explore how to apply imitation learning (IL) to learn continuous visuo-motor policies for MM tasks.
arXiv Detail & Related papers (2021-12-09T23:54:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.