Related papers: UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

URL: http://arxiv.org/abs/2510.17790v1
Date: Mon, 20 Oct 2025 17:48:26 GMT
Title: UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
Authors: Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan,
Abstract summary: We present UltraCUA, a foundation model that bridges the gap between GUI primitives and high-level programmatic tool calls.<n>Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents.
Score: 77.63125913907771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

Related papers

MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z)
From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents [23.583947864141162]
EigenData is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers.<n>Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training.<n>Our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
arXiv Detail & Related papers (2026-01-30T06:01:23Z)
Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning [42.1534425503333]
CrossAgent is a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory.<n>Experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-10T14:52:29Z)
Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering [11.375577889547351]
We introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation.<n>The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation.<n>Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets.
arXiv Detail & Related papers (2025-11-22T23:12:56Z)
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z)
Reinforcement Learning for Machine Learning Engineering Agents [52.03168614623642]
We show that agents backed by weaker models that improve via reinforcement learning can outperform agents backed by much larger, but static models.<n>We propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions.<n>We also propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early.
arXiv Detail & Related papers (2025-09-01T18:04:10Z)
AWorld: Orchestrating the Training Recipe for Agentic AI [35.94278765364194]
We introduce AWorld, an open-source system engineered for large-scale agent-environment interaction.<n>By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution.<n>We trained a Qwen3-32B-based agent that achieves pass@1 accuracy of 32.23% on the GAIA test set.
arXiv Detail & Related papers (2025-08-28T04:04:30Z)
CoAct-1: Computer-using Agents with Coding as Actions [94.99657662893338]
CoAct-1 is a novel multi-agent system that combines GUI-based control with direct programmatic execution.<n>We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%.
arXiv Detail & Related papers (2025-08-05T21:33:36Z)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents.<n>Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions.<n>We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z)
MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning [17.437573206368494]
Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks.<n>We present MENTOR, a method that improves both the architecture and optimization of RL agents.<n>MenTOR outperforms state-of-the-art methods across three simulation benchmarks and achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks.
arXiv Detail & Related papers (2024-10-19T04:31:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.