Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation
- URL: http://arxiv.org/abs/2511.12254v1
- Date: Sat, 15 Nov 2025 15:22:42 GMT
- Title: Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation
- Authors: Yuxiang Zhou, Jichang Li, Yanhao Zhang, Haonan Lu, Guanbin Li,
- Abstract summary: Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks.<n>We propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation.
- Score: 57.12284831164602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.
Related papers
- K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control [73.50217471850658]
K2-Agent is a hierarchical framework that models human-like cognition by knowing and co-evolving declarative (what) and procedural (how) knowledge for planning and execution.<n>On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw and open-source backbones.
arXiv Detail & Related papers (2026-02-28T14:33:14Z) - AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z) - Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control [72.43808515668947]
We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control.<n>Hi-Agent features a high-level reasoning model and a low-level action model that are jointly optimized.<n>Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark.
arXiv Detail & Related papers (2025-10-16T07:38:21Z) - CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning [18.826366389246385]
We propose a new mobile assistant architecture with constrained high-frequency optimized planning (CHOP)<n>Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human-planned subtasks as the basis vector.<n>We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency.
arXiv Detail & Related papers (2025-03-05T18:56:16Z) - Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience.<n>Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z) - MobileExperts: A Dynamic Tool-Enabled Agent Team in Mobile Devices [17.702068044185086]
This paper introduces MobileExperts, which for the first time introduces tool formulation and multi-agent collaboration.
We develop a dual-layer planning mechanism to establish coordinate collaboration among experts.
Experimental results demonstrate that MobileExperts performs better on all intelligence levels and achieves 22% reduction in reasoning costs.
arXiv Detail & Related papers (2024-07-04T13:12:19Z) - Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance.
The architecture comprises three agents: planning agent, decision agent, and reflection agent.
We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z) - RL-GPT: Integrating Reinforcement Learning and Code-as-policy [82.1804241891039]
We introduce a two-level hierarchical framework, RL-GPT, comprising a slow agent and a fast agent.
The slow agent analyzes actions suitable for coding, while the fast agent executes coding tasks.
This decomposition effectively focuses each agent on specific tasks, proving highly efficient within our pipeline.
arXiv Detail & Related papers (2024-02-29T16:07:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.