OpenCUA: Open Foundations for Computer-Use Agents
- URL: http://arxiv.org/abs/2508.09123v2
- Date: Thu, 14 Aug 2025 16:47:06 GMT
- Title: OpenCUA: Open Foundations for Computer-Use Agents
- Authors: Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Haotian Yao, Ziwei Chen, Qizheng Gu, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu,
- Abstract summary: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs)<n>As their commercial potential grows, critical details of the most capable CUA systems remain closed.<n>We propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models.
- Score: 71.17624594647768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
Related papers
- FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access [7.480885391518904]
FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs) on existing HPC infrastructure.<n>System allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments.
arXiv Detail & Related papers (2025-10-15T16:28:34Z) - LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science [69.1690891731311]
We propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models.<n>In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents respond based on their capabilities.<n>We evaluate our method on three benchmarks that require explicit data discovery.
arXiv Detail & Related papers (2025-09-30T22:34:23Z) - ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data [119.41354691583899]
ScaleCUA is a step toward scaling open-source computer use data and foundation models.<n>It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts.
arXiv Detail & Related papers (2025-09-18T17:59:22Z) - DPO Learning with LLMs-Judge Signal for Computer Use Agents [9.454381108993832]
Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks.<n>We develop a lightweight vision-language model that runs entirely on local machines.
arXiv Detail & Related papers (2025-06-03T17:27:04Z) - O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering [31.38063794496179]
O$2$-Searcher is a novel search agent leveraging reinforcement learning to tackle both open-ended and closed-ended questions in the open domain.<n>It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies.<n>Extensive experiments show that O$2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$2$-QA.
arXiv Detail & Related papers (2025-05-22T12:17:13Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models [61.14336781917986]
We introduce OpenR, an open-source framework for enhancing the reasoning capabilities of large language models (LLMs)
OpenR unifies data acquisition, reinforcement learning training, and non-autoregressive decoding into a cohesive software platform.
Our work is the first to provide an open-source framework that explores the core techniques of OpenAI's o1 model with reinforcement learning.
arXiv Detail & Related papers (2024-10-12T23:42:16Z) - NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals [58.83169560132308]
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks.
arXiv Detail & Related papers (2024-07-18T17:59:01Z) - OS-Copilot: Towards Generalist Computer Agents with Self-Improvement [48.29860831901484]
We introduce OS-Copilot, a framework to build generalist agents capable of interfacing with comprehensive elements in an operating system (OS)
We use OS-Copilot to create FRIDAY, a self-improving embodied agent for automating general computer tasks.
On GAIA, a general AI assistants benchmark, FRIDAY outperforms previous methods by 35%, showcasing strong generalization to unseen applications via accumulated skills from previous tasks.
arXiv Detail & Related papers (2024-02-12T07:29:22Z) - Serving Deep Learning Model in Relational Databases [70.53282490832189]
Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains.
We highlight three pivotal paradigms: The state-of-the-art DL-centric architecture offloads DL computations to dedicated DL frameworks.
The potential UDF-centric architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS)
arXiv Detail & Related papers (2023-10-07T06:01:35Z) - Intelligence-Endogenous Management Platform for Computing and Network
Convergence [33.45559800534038]
We present the concept of an intelligence-endogenous management platform for CNCs called emphCNC brain based on artificial intelligence technologies.
It aims at efficiently matching the supply and demand with high heterogeneity in a CNC via four key building blocks, i.e., perception, scheduling, adaptation, and governance.
It is evaluated on a CNC testbed that integrates two open-source and popular frameworks and a real-world business dataset provided by Microsoft Azure.
arXiv Detail & Related papers (2023-08-07T10:12:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.