Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
- URL: http://arxiv.org/abs/2503.15937v2
- Date: Fri, 21 Mar 2025 03:19:57 GMT
- Title: Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
- Authors: Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu,
- Abstract summary: V-Droid is a mobile task automation agent that employs Large Language Models as verifiers.<n>V-Droid sets a new state-of-the-art task success rate across several public mobile task automation benchmarks.<n>V-Droid achieves an impressively low latency of 0.7 seconds per step, making it the first mobile agent capable of delivering near-real-time, effective decision-making capabilities.
- Score: 14.326779061712404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid sets a new state-of-the-art task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 9.5%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves an impressively low latency of 0.7 seconds per step, making it the first mobile agent capable of delivering near-real-time, effective decision-making capabilities.
Related papers
- Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration [53.54951412651823]
Mobile-Agent-V is a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation.<n>Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions.<n>Results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks.
arXiv Detail & Related papers (2025-02-24T12:51:23Z) - ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation [11.931584529573176]
Given a task, mobile AI agents can interact with mobile devices in multiple steps and form a GUI flow that solves the task.<n>To address this issue, we constructed a training dataset called MobileReach, which breaks the task into page reaching and operation subtasks.<n>We propose ReachAgent, a two-stage framework that focuses on improving its task-completion abilities.
arXiv Detail & Related papers (2025-02-05T07:35:23Z) - Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience.
Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z) - A3: Android Agent Arena for Mobile GUI Agents [46.73085454978007]
Mobile GUI agents are designed to autonomously perform tasks on mobile devices.<n>Android Agent Arena (A3) is a novel evaluation platform for assessing performance on real-world, in-the-wild tasks.<n>A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios.
arXiv Detail & Related papers (2025-01-02T09:03:56Z) - Foundations and Recent Trends in Multimodal Mobile Agents: A Survey [57.677161006710065]
Mobile agents are essential for automating tasks in complex and dynamic mobile environments.
Recent advancements enhance real-time adaptability and multimodal interaction.
We categorize these advancements into two main approaches: prompt-based methods and training-based methods.
arXiv Detail & Related papers (2024-11-04T11:50:58Z) - AutoGLM: Autonomous Foundation Agents for GUIs [51.276965515952]
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs)
We have developed AutoGLM as a practical foundation agent system for real-world GUI interactions.
Our evaluations demonstrate AutoGLM's effectiveness across multiple domains.
arXiv Detail & Related papers (2024-10-28T17:05:10Z) - MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents [7.4568642040547894]
Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs)
Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents.
We propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing.
arXiv Detail & Related papers (2024-06-12T13:14:50Z) - Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance.
The architecture comprises three agents: planning agent, decision agent, and reflection agent.
We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z) - Benchmarking Mobile Device Control Agents across Diverse Configurations [19.01954948183538]
B-MoCA is a benchmark for evaluating and developing mobile device control agents.
We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs.
While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness.
arXiv Detail & Related papers (2024-04-25T14:56:32Z) - Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception [52.5831204440714]
We introduce Mobile-Agent, an autonomous multi-modal mobile device agent.
Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface.
It then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step.
arXiv Detail & Related papers (2024-01-29T13:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.