Fairy: Interactive Mobile Assistant to Real-world Tasks via LMM-based Multi-agent
- URL: http://arxiv.org/abs/2509.20729v1
- Date: Thu, 25 Sep 2025 04:21:31 GMT
- Title: Fairy: Interactive Mobile Assistant to Real-world Tasks via LMM-based Multi-agent
- Authors: Jiazheng Sun, Te Yang, Jiayang Niu, Mingxuan Li, Yongyong Lu, Ruimeng Yang, Xin Peng,
- Abstract summary: Fairy is an interactive multi-agent mobile assistant capable of continuously accumulating app knowledge and self-evolving during usage.<n>Three core modules: (i) a Global Task Planner that decomposes user tasks into sub-tasks from a cross-app view; (ii) an App-Level Executor that refines sub-tasks into steps and actions based on long- and short-term memory; and (iii) a Self-Learner that consolidates execution experience into App Map and Tricks.
- Score: 7.715715198300168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large multi-modal models (LMMs) have advanced mobile GUI agents. However, existing methods struggle with real-world scenarios involving diverse app interfaces and evolving user needs. End-to-end methods relying on model's commonsense often fail on long-tail apps, and agents without user interaction act unilaterally, harming user experience. To address these limitations, we propose Fairy, an interactive multi-agent mobile assistant capable of continuously accumulating app knowledge and self-evolving during usage. Fairy enables cross-app collaboration, interactive execution, and continual learning through three core modules:(i) a Global Task Planner that decomposes user tasks into sub-tasks from a cross-app view; (ii) an App-Level Executor that refines sub-tasks into steps and actions based on long- and short-term memory, achieving precise execution and user interaction via four core agents operating in dual loops; and (iii) a Self-Learner that consolidates execution experience into App Map and Tricks. To evaluate Fairy, we introduce RealMobile-Eval, a real-world benchmark with a comprehensive metric suite, and LMM-based agents for automated scoring. Experiments show that Fairy with GPT-4o backbone outperforms the previous SoTA by improving user requirement completion by 33.7% and reducing redundant steps by 58.5%, showing the effectiveness of its interaction and self-learning.
Related papers
- AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent [49.61420186190895]
This paper identifies four core problems that must be solved for mobile agents to deliver practical, scalable impact.<n>We present AppCopilot, a multimodal, multi-agent, general-purpose on-device assistant.<n>AppCopilot operates across applications and constitutes a full-stack, closed-loop system from data to deployment.
arXiv Detail & Related papers (2025-09-02T15:48:21Z) - MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions [11.021990614727702]
We present textbfMVISU-Bench, a benchmark that includes 404 tasks across 137 mobile applications.<n>We also propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents.
arXiv Detail & Related papers (2025-08-12T16:18:30Z) - Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience.<n>Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z) - Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping [57.024913536420264]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task.<n>We present the first systematic investigation of MLLMs in generating interactive webpages.
arXiv Detail & Related papers (2024-11-05T17:40:03Z) - Foundations and Recent Trends in Multimodal Mobile Agents: A Survey [72.29426995154088]
Mobile agents are essential for automating tasks in complex and dynamic mobile environments.<n>Recent advancements enhance real-time adaptability and multimodal interaction.<n>We categorize these advancements into two main approaches: prompt-based methods and training-based methods.
arXiv Detail & Related papers (2024-11-04T11:50:58Z) - AppAgent v2: Advanced Agent for Flexible Mobile Interactions [57.98933460388985]
This work introduces a novel LLM-based multimodal agent framework for mobile devices.<n>Our agent constructs a flexible action space that enhances adaptability across various applications.<n>Our results demonstrate the framework's superior performance, confirming its effectiveness in real-world scenarios.
arXiv Detail & Related papers (2024-08-05T06:31:39Z) - GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices [47.98821056800437]
We present GUIOdyssey, a dataset for cross-app mobile GUI navigation.<n>GuiOdyssey comprises 8,334 episodes with an average of 15.3 steps per episode, covering 6 mobile devices, 212 distinct apps, and 1,357 app combinations.<n>We develop OdysseyAgent, an exploratory multimodal agent for long-step cross-app navigation equipped with a history resampler module.
arXiv Detail & Related papers (2024-06-12T17:44:26Z) - Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance.
The architecture comprises three agents: planning agent, decision agent, and reflection agent.
We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z) - MobileAgent: enhancing mobile control via human-machine interaction and
SOP integration [0.0]
Large Language Models (LLMs) are now capable of automating mobile device operations for users.
Privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation.
We have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs.
Our approach is evaluated on the new device control benchmark AitW, which encompasses 30K unique instructions across multi-step tasks.
arXiv Detail & Related papers (2024-01-04T03:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.