AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
- URL: http://arxiv.org/abs/2410.24024v2
- Date: Mon, 04 Nov 2024 05:57:31 GMT
- Title: AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
- Authors: Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong,
- Abstract summary: We propose AndroidLab as a systematic Android agent framework.
It includes an operation environment with different modalities, action space, and a reproducible benchmark.
It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space.
- Score: 32.571194718225996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.
Related papers
- AndroidGen: Building an Android Language Agent under Data Scarcity [32.277219971739726]
We develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity.
We leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories.
We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement.
arXiv Detail & Related papers (2025-04-27T16:30:10Z) - Reinforcement Learning for Long-Horizon Interactive LLM Agents [56.9860859585028]
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests.
We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments.
We derive LOOP, a data- and memory-efficient variant of proximal policy optimization.
arXiv Detail & Related papers (2025-02-03T18:35:42Z) - SlimLM: An Efficient Small Language Model for On-Device Document Assistance [60.971107009492606]
We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices.
SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist.
We evaluate SlimLM against existing SLMs, showing comparable or superior performance.
arXiv Detail & Related papers (2024-11-15T04:44:34Z) - PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training [6.827011856777674]
Existing small language models (SLM) for on-device deployment don't consider device hardware characteristics.
This work presents a simple yet effective principle for SLM design: architecture searching for (near-)optimal runtime efficiency before pre-training.
We develop PhoneLM family (currently with 0.5B and 1.5B versions), that acheive the state-of-the-art capability-efficiency tradeoff among those with similar parameter size.
arXiv Detail & Related papers (2024-11-07T02:19:00Z) - AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents [5.044046039265116]
We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps.
Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language.
Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work.
arXiv Detail & Related papers (2024-05-23T13:48:54Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z) - AutoDroid: LLM-powered Task Automation in Android [32.241570727243534]
We introduce AutoDroid, a mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts.
The main components include a functionality-aware UI representation method that bridges the UI with the LLM.
We evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks.
arXiv Detail & Related papers (2023-08-29T13:02:30Z) - DroidBot-GPT: GPT-powered UI Automation for Android [11.980924738484994]
DroidBot-GPT is a tool that utilizes GPT-like large language models (LLMs) to automate the interactions with Android mobile applications.
Given a natural language description of a desired task, DroidBot-GPT can automatically generate and execute actions that navigate the app to complete the task.
arXiv Detail & Related papers (2023-04-14T11:31:56Z) - BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [264.96498474333697]
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions.
We present BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers.
BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages.
arXiv Detail & Related papers (2022-11-09T18:48:09Z) - AndroidEnv: A Reinforcement Learning Platform for Android [41.572096255032946]
AndroidEnv is an open-source platform for Reinforcement Learning (RL) research built on top of the Android ecosystem.
It allows RL agents to interact with a wide variety of apps and services commonly used by humans through a universal touchscreen interface.
Since agents train on a realistic simulation of an Android device, they have the potential to be deployed on real devices.
arXiv Detail & Related papers (2021-05-27T15:20:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.