Related papers: AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation

AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation

URL: http://arxiv.org/abs/2412.18116v2
Date: Thu, 26 Dec 2024 13:52:48 GMT
Title: AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation
Authors: Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, Yuanchun Li,
Abstract summary: Large language models (LLMs) have brought exciting new advances to mobile UI agents.<n>One way to reduce the required model size is to customize a smaller domain-specific model.<n>We propose to convert the UI task automation problem to a code generation problem.
Score: 27.984521240600493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have brought exciting new advances to mobile UI agents, a long-standing research field that aims to complete arbitrary natural language tasks through mobile UI interactions. However, existing UI agents usually demand high reasoning capabilities of powerful large models that are difficult to be deployed locally on end-users' devices, which raises huge concerns about user privacy and centralized serving cost. One way to reduce the required model size is to customize a smaller domain-specific model with high-quality training data, e.g. large-scale human demonstrations of diverse types of apps and tasks, while such datasets are extremely difficult to obtain. Inspired by the remarkable coding abilities of recent small language models (SLMs), we propose to convert the UI task automation problem to a code generation problem, which can be effectively solved by an on-device SLM and efficiently executed with an on-device code interpreter. Unlike normal coding tasks that can be extensively pretrained with public datasets, generating UI automation code is challenging due to the diversity, complexity, and variability of target apps. Therefore, we adopt a document-centered approach that automatically builds fine-grained API documentation for each app and generates diverse task samples based on this documentation. By guiding the agent with the synthetic documents and task samples, it learns to generate precise and efficient scripts to complete unseen tasks. Based on detailed comparisons with state-of-the-art mobile UI agents, our approach effectively improves the mobile task automation with significantly higher success rates and lower latency/token consumption. Code will be open-sourced.

Related papers

AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z)
UFO2: The Desktop AgentOS [60.317812905300336]
UFO2 is a multiagent AgentOS for Windows desktops that elevates into practical, system-level automation. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
arXiv Detail & Related papers (2025-04-20T13:04:43Z)
Toward Super Agent System with Hybrid AI Routers [19.22599167969104]
Super agents can fulfill diverse user needs, such as summarization, coding, and research. To make such an agent viable for real-world deployment and accessible at scale, significant optimizations are required. This paper presents a design of the Super Agent System.
arXiv Detail & Related papers (2025-04-11T00:54:56Z)
ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents [0.0]
ClickAgent is a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model identifies the relevant UI elements on the screen. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.
arXiv Detail & Related papers (2024-10-09T14:49:02Z)
AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML [56.565200973244146]
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline. Recent works have started exploiting large language models (LLM) to lessen such burden. This paper proposes AutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML.
arXiv Detail & Related papers (2024-10-03T20:01:09Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
We propose an agent that perceives its environment solely through screenshot images. By leveraging the reasoning capability of the Large Language Models, we eliminate the need for large-scale human demonstration data. Agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop.
arXiv Detail & Related papers (2024-06-11T05:21:20Z)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z)
Benchmarking Mobile Device Control Agents across Diverse Configurations [19.01954948183538]
B-MoCA is a benchmark for evaluating and developing mobile device control agents. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness.
arXiv Detail & Related papers (2024-04-25T14:56:32Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API [17.991044940694778]
We build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm. Our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-10-07T07:22:41Z)
Agents: An Open-source Framework for Autonomous Language Agents [98.91085725608917]
We consider language agents as a promising direction towards artificial general intelligence. We release Agents, an open-source library with the goal of opening up these advances to a wider non-specialist audience.
arXiv Detail & Related papers (2023-09-14T17:18:25Z)
AutoDroid: LLM-powered Task Automation in Android [32.241570727243534]
We introduce AutoDroid, a mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The main components include a functionality-aware UI representation method that bridges the UI with the LLM. We evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks.
arXiv Detail & Related papers (2023-08-29T13:02:30Z)
AutoML-GPT: Automatic Machine Learning with GPT [74.30699827690596]
We propose developing task-oriented prompts and automatically utilizing large language models (LLMs) to automate the training pipeline. We present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyper parameters. This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas.
arXiv Detail & Related papers (2023-05-04T02:09:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.