Related papers: OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

URL: http://arxiv.org/abs/2506.16042v1
Date: Thu, 19 Jun 2025 05:26:40 GMT
Title: OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
Authors: Reyna Abhyankar, Qi Qi, Yiying Zhang,
Abstract summary: We conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI.<n>We find that large model calls for planning and reflection account for the majority of the overall latency.<n>We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task.
Score: 6.726770697869473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning and reflection account for the majority of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the highest-scoring agents on OSWorld take 1.4-2.7x more steps than necessary.

Related papers

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.57043903478257]
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations.<n>With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality.<n>This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development.
arXiv Detail & Related papers (2025-08-06T14:33:45Z)
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents [30.253353551910404]
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices.<n>We introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models.<n>Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks.
arXiv Detail & Related papers (2025-04-01T15:40:27Z)
HCAST: Human-Calibrated Autonomy Software Tasks [1.5287939112540956]
We present HCAST, a benchmark of 189 machine learning engineering, cybersecurity, software engineering, and general reasoning tasks.<n>We estimate that HCAST tasks take humans between one minute and 8+ hours.<n>We evaluate the success rates of AI agents built on frontier foundation models.
arXiv Detail & Related papers (2025-03-21T17:54:01Z)
STEVE: A Step Verification Pipeline for Computer-use Agent Training [84.24814828303163]
STEVE is a step verification pipeline for computer-use agent training.<n> GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution.<n>Our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory.
arXiv Detail & Related papers (2025-03-16T14:53:43Z)
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We introduce TheAgentCompany, a benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker.<n>We find that the most competitive agent can complete 30% of tasks autonomously.<n>This paints a nuanced picture on task automation with simulating LM agents in a setting a real workplace.
arXiv Detail & Related papers (2024-12-18T18:55:40Z)
Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.16046798529319]
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI) Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces.
arXiv Detail & Related papers (2024-10-10T17:43:51Z)
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [87.41051677852231]
We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks. We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
arXiv Detail & Related papers (2024-04-11T17:56:05Z)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web [43.60736044871539]
We introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate programs. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet" Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks.
arXiv Detail & Related papers (2024-02-27T14:47:53Z)
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement [48.29860831901484]
We introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS) We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks.
arXiv Detail & Related papers (2024-02-12T07:29:22Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.