ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
- URL: http://arxiv.org/abs/2509.15221v2
- Date: Fri, 19 Sep 2025 05:29:03 GMT
- Title: ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
- Authors: Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Xuan Dong, Yue Yu, Chenyu Lu, YunXiang Mo, Yao Yan, Zeyue Tian, Xiao Zhang, Yuan Huang, Yiqian Liu, Weijie Su, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang,
- Abstract summary: ScaleCUA is a step toward scaling open-source computer use data and foundation models.<n>It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts.
- Score: 119.41354691583899
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.
Related papers
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents [56.72789202127874]
The paper introduces GUI-Owl-1.5, the latest native GUI agent model.<n>It supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction.<n>It achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models.
arXiv Detail & Related papers (2026-02-15T01:52:19Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - OpenCUA: Open Foundations for Computer-Use Agents [71.17624594647768]
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs)<n>As their commercial potential grows, critical details of the most capable CUA systems remain closed.<n>We propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models.
arXiv Detail & Related papers (2025-08-12T17:52:32Z) - OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.37173845836839]
OS-Atlas is a foundational GUI action model that excels at GUI grounding and OOD agentic tasks.
We are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
arXiv Detail & Related papers (2024-10-30T17:10:19Z) - TinyAgent: Function Calling at the Edge [32.174966522801746]
We present an end-to-end framework for training and deploying task-specific small language model agents capable of function calling for driving agentic systems at the edge.
As a driving application, we demonstrate a local Siri-like system for Apple's MacBook that can execute user commands through text or voice input.
arXiv Detail & Related papers (2024-09-01T04:23:48Z) - Reproducible scaling laws for contrastive language-image learning [42.354402731615444]
We investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.
Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks.
We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures.
arXiv Detail & Related papers (2022-12-14T10:24:50Z) - Scaling Up Models and Data with $\ exttt{t5x}$ and $\ exttt{seqio}$ [118.04625413322827]
$texttt5x$ and $texttseqio$ are open source software libraries for building and training language models.
These libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data.
arXiv Detail & Related papers (2022-03-31T17:12:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.