Related papers: AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

URL: http://arxiv.org/abs/2405.14573v3
Date: Tue, 22 Oct 2024 20:49:38 GMT
Title: AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Authors: Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva,
Abstract summary: We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language. Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work.
Score: 5.044046039265116
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device's system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the benchmark. Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. AndroidWorld and the experiments in this paper are available at github.com/google-research/android_world.

Related papers

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents. SPA-Bench offers three key contributions: A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines. A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge [93.4434417387526]
We propose Open Vocabulary Mobile Manipulation as a key benchmark task for robotics. We organized a NeurIPS 2023 competition featuring both simulation and real-world components to evaluate solutions to this task. We detail the results and methodologies used, both in simulation and real-world settings.
arXiv Detail & Related papers (2024-07-09T15:15:01Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents [7.4568642040547894]
Large language model (LLM)-based mobile agents are increasingly popular due to their capability to interact directly with mobile phone Graphic User Interfaces (GUIs) Despite their promising prospects in both academic and industrial sectors, little research has focused on benchmarking the performance of existing mobile agents. We propose an efficient and user-friendly benchmark, MobileAgentBench, designed to alleviate the burden of extensive manual testing.
arXiv Detail & Related papers (2024-06-12T13:14:50Z)
Benchmarking Mobile Device Control Agents across Diverse Configurations [19.01954948183538]
B-MoCA is a benchmark for evaluating and developing mobile device control agents. We benchmark diverse agents, including agents employing large language models (LLMs) or multi-modal LLMs. While these agents demonstrate proficiency in executing straightforward tasks, their poor performance on complex tasks highlights significant opportunities for future research to improve effectiveness.
arXiv Detail & Related papers (2024-04-25T14:56:32Z)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks. We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
HomeRobot: Open-Vocabulary Mobile Manipulation [107.05702777141178]
Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch.
arXiv Detail & Related papers (2023-06-20T14:30:32Z)
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction [28.53259866617677]
We introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment. We collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents. Our findings reveal that even advanced models struggle with tasks that are relatively simple for humans.
arXiv Detail & Related papers (2023-05-14T12:31:03Z)
ALAN: Autonomously Exploring Robotic Agents in the Real World [28.65531878636441]
ALAN is an autonomously exploring robotic agent that can perform tasks in the real world with little training and interaction time. This is enabled by measuring environment change, which reflects object movement and ignores changes in the robot position. We evaluate our approach on two different real-world play kitchen settings, enabling a robot to efficiently explore and discover manipulation skills.
arXiv Detail & Related papers (2023-02-13T18:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.