Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- URL: http://arxiv.org/abs/2401.16158v2
- Date: Thu, 18 Apr 2024 06:53:38 GMT
- Title: Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
- Authors: Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang,
- Abstract summary: We introduce Mobile-Agent, an autonomous multi-modal mobile device agent.
Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface.
It then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step.
- Score: 52.5831204440714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.
Related papers
- Very Large-Scale Multi-Agent Simulation in AgentScope [115.83581238212611]
We develop new features and components for AgentScope, a user-friendly multi-agent platform.
We propose an actor-based distributed mechanism towards great scalability and high efficiency.
We provide a web-based interface for conveniently monitoring and managing a large number of agents.
arXiv Detail & Related papers (2024-07-25T05:50:46Z) - MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices [17.702068044185086]
This paper introduces MobileExperts, which for the first time introduces tool formulation and multi-agent collaboration.
We develop a dual-layer planning mechanism to establish coordinate collaboration among experts.
Experimental results demonstrate that MobileExperts performs better on all intelligence levels and achieves 22% reduction in reasoning costs.
arXiv Detail & Related papers (2024-07-04T13:12:19Z) - Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents [46.81304373693033]
Large language models (LLMs) have become a research hotspot in human-computer interaction.
Mobile-Bench is a novel benchmark for evaluating the capabilities of LLM-based mobile agents.
arXiv Detail & Related papers (2024-07-01T06:10:01Z) - MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents [7.4568642040547894]
Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs)
Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents.
We propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing.
arXiv Detail & Related papers (2024-06-12T13:14:50Z) - Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance.
The architecture comprises three agents: planning agent, decision agent, and reflection agent.
We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z) - Benchmarking Mobile Device Control Agents across Diverse Configurations [21.164023091324523]
B-MoCA is a novel benchmark for evaluating mobile device control agents.
We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs as well as agents trained from scratch using human expert demonstrations.
arXiv Detail & Related papers (2024-04-25T14:56:32Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost
Whole-Body Teleoperation [59.21899709023333]
We develop a system for imitating mobile manipulation tasks that are bimanual and require whole-body control.
Mobile ALOHA is a low-cost and whole-body teleoperation system for data collection.
Co-training can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks.
arXiv Detail & Related papers (2024-01-04T07:55:53Z) - MobileAgent: enhancing mobile control via human-machine interaction and
SOP integration [0.0]
Large Language Models (LLMs) are now capable of automating mobile device operations for users.
Privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation.
We have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs.
Our approach is evaluated on the new device control benchmark AitW, which encompasses 30K unique instructions across multi-step tasks.
arXiv Detail & Related papers (2024-01-04T03:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.