OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
- URL: http://arxiv.org/abs/2511.20766v1
- Date: Tue, 25 Nov 2025 19:00:22 GMT
- Title: OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
- Authors: Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim,
- Abstract summary: Reliability is key to realizing the promise of autonomous UI-Agents.<n>We develop OpenApps, a light-weight open-source ecosystem with six apps.<n>We run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents.
- Score: 49.99934595922838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent's ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than $50\%$ across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from $63\%$ to just $4\%$ across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations. OpenApps is available at https://facebookresearch.github.io/OpenApps/
Related papers
- AppSelectBench: Application-Level Tool Selection Benchmark [57.03660843195562]
AppSelectBench is a benchmark for evaluating application selection in computer using agents (CUAs)<n>It contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale.<n>It includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks.
arXiv Detail & Related papers (2025-11-25T06:06:17Z) - The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution [86.4588675093384]
Toolathlon is a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.<n>This benchmark includes 108 manually sourced or crafted tasks, requiring interacting with multiple Apps over around 20 turns on average to complete.<n>We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
arXiv Detail & Related papers (2025-10-29T17:32:49Z) - Coding Agents with Multimodal Browsing are Generalist Problem Solvers [48.938445118630284]
OpenHands-Versa is a generalist AI agent built with a modest number of general tools.<n>We show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains.
arXiv Detail & Related papers (2025-06-03T15:50:55Z) - Building reliable sim driving agents by scaling self-play [3.3378669626639423]
Simulation agents are essential for designing and testing systems that interact with humans, such as autonomous vehicles (AVs)<n>We propose scaling self-play to thousands of scenarios on the Open Motion dataset under semi-realistic limits on human perception and control.<n>We generalize to unseen test scenes, achieving a 99.8% goal completion rate with less than 0.8% combined collision and off-road incidents.
arXiv Detail & Related papers (2025-02-20T16:30:45Z) - CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation [70.3224918173672]
CowPilot is a framework supporting autonomous as well as human-agent collaborative web navigation.<n>It reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions.<n>CowPilot can serve as a useful tool for data collection and agent evaluation across websites.
arXiv Detail & Related papers (2025-01-28T00:56:53Z) - AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents [44.16450035387395]
AppWorld is a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs.
$textbfAppWorld Benchmark$ (40K lines of code) is a suite of 750 natural, diverse, and challenging autonomous agent tasks.
arXiv Detail & Related papers (2024-07-26T17:55:45Z) - $τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains [43.43344028212623]
$tau$-bench is a benchmark emulating dynamic conversations between a user and a language agent.
We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state.
arXiv Detail & Related papers (2024-06-17T19:33:08Z) - AppAgent: Multimodal Agents as Smartphone Users [23.318925173980446]
Our framework enables the agent to operate smartphone applications through a simplified action space.
The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations.
To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications.
arXiv Detail & Related papers (2023-12-21T11:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.