Related papers: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

URL: http://arxiv.org/abs/2404.07972v2
Date: Thu, 30 May 2024 08:55:12 GMT
Title: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Authors: Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu,
Abstract summary: We introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks. We create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and spanning multiple applications.
Score: 87.41051677852231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

Related papers

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.57043903478257]
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations.<n>With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality.<n>This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development.
arXiv Detail & Related papers (2025-08-06T14:33:45Z)
OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth? [30.788287791669458]
OS-MAP is a benchmark for daily computer-using automation.<n>It organizes its 416 realistic tasks across 15 applications along two key dimensions.<n>It captures varying levels of required agent autonomy and generalization.
arXiv Detail & Related papers (2025-07-25T10:14:53Z)
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents [6.726770697869473]
We conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI.<n>We find that large model calls for planning and reflection account for the majority of the overall latency.<n>We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task.
arXiv Detail & Related papers (2025-06-19T05:26:40Z)
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis [59.83524388782554]
Graphical user interface (GUI) grounding remains a critical bottleneck in computer use agent development.<n>We introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types.<n>We synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples.
arXiv Detail & Related papers (2025-05-19T15:09:23Z)
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction [35.285466934451904]
This paper introduces textscInfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner.<n>Unlike existing approaches that either build intricate around a single large model or only provide modularity, our agent integrates tool-based and pure vision agents.
arXiv Detail & Related papers (2025-05-16T05:43:27Z)
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction [16.731754927372585]
We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents. Unlike online benchmarks, UI-Vision provides dense, high-quality annotations of human demonstrations. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B.
arXiv Detail & Related papers (2025-03-19T19:26:17Z)
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We build a self-contained environment with data that mimics a small software company environment. We find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents.
arXiv Detail & Related papers (2024-12-18T18:55:40Z)
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale [22.493676199881794]
Large language models (LLMs) show remarkable potential to act as computer agents. measuring agent performance in realistic environments remains a challenge. We introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human.
arXiv Detail & Related papers (2024-09-12T17:56:43Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)
AgentStudio: A Toolkit for Building General Virtual Agents [57.02375267926862]
General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces. It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos. Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function calling with efficient auto-evaluation.
arXiv Detail & Related papers (2024-03-26T17:54:15Z)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web [43.60736044871539]
We introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate programs. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet" Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks.
arXiv Detail & Related papers (2024-02-27T14:47:53Z)
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement [48.29860831901484]
We introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS) We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks. On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks.
arXiv Detail & Related papers (2024-02-12T07:29:22Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.