Related papers: OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

URL: http://arxiv.org/abs/2505.03570v1
Date: Tue, 06 May 2025 14:29:47 GMT
Title: OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
Authors: Mariya Davydova, Daniel Jeffries, Patrick Barker, Arturo Márquez Flores, Sinéad Ryan,
Abstract summary: OSUniverse is a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents.<n>We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent.<n>The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.

Related papers

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z)
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems [0.402058998065435]
xOffense is an AI-driven, multi-agent penetration testing framework.<n>It shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable scaling seamlessly with computational infrastructure.
arXiv Detail & Related papers (2025-09-16T12:45:45Z)
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z)
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science [70.1638335489284]
High-level machine learning engineering tasks remain labor-intensive and iterative.<n>We introduce R&D-Agent, a comprehensive, decoupled, and framework that formalizes the machine learning process.<n>R&D-Agent defines the MLE into two phases and six components, turning agent design for MLE into a principled, testable process.
arXiv Detail & Related papers (2025-05-20T06:07:00Z)
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction [35.285466934451904]
This paper introduces textscInfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner.<n>Unlike existing approaches that either build intricate around a single large model or only provide modularity, our agent integrates tool-based and pure vision agents.
arXiv Detail & Related papers (2025-05-16T05:43:27Z)
GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent [24.97846085313314]
We propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing.<n>We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection.<n>It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data.
arXiv Detail & Related papers (2024-12-24T13:41:47Z)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
Smartphone agents are increasingly important for helping users control devices efficiently.<n>We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
A Preliminary Study on Using Large Language Models in Software Pentesting [2.0551676463612636]
Large language models (LLM) are perceived to offer promising potentials for automating security tasks. We investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code.
arXiv Detail & Related papers (2024-01-30T21:42:59Z)
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks [2.0315147707806283]
Mystique is an accurate and scalable framework for production AI benchmark generation. Mystique is scalable, due to its lightweight data collection, in terms of overhead runtime and instrumentation effort. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models.
arXiv Detail & Related papers (2022-12-16T18:46:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.