Related papers: Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering

Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering

URL: http://arxiv.org/abs/2512.10962v1
Date: Sat, 22 Nov 2025 23:12:56 GMT
Title: Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering
Authors: Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, Xia Song,
Abstract summary: We introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation.<n>The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation.<n>Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets.
Score: 11.375577889547351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 100K graded, reasoning-rich steps synthesized from OpenAI's computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets (WebSTAR, WebSCORE) and a lightweight reward model (StepRM) as practical tools to advance robust and efficient CUAs.

Related papers

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents [23.583947864141162]
EigenData is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers.<n>Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training.<n>Our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
arXiv Detail & Related papers (2026-01-30T06:01:23Z)
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action [77.63125913907771]
We present UltraCUA, a foundation model that bridges the gap between GUI primitives and high-level programmatic tool calls.<n>Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents.
arXiv Detail & Related papers (2025-10-20T17:48:26Z)
GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments [1.6398143439811486]
Training robust world models requires large-scale, precisely labeled multimodal datasets.<n>We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision.
arXiv Detail & Related papers (2025-10-07T21:13:03Z)
SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning [29.14330314090061]
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations.<n>Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity.<n>We introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies.
arXiv Detail & Related papers (2025-05-28T17:45:05Z)
Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
STEVE: A Step Verification Pipeline for Computer-use Agent Training [84.24814828303163]
STEVE is a step verification pipeline for computer-use agent training.<n> GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution.<n>Our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory.
arXiv Detail & Related papers (2025-03-16T14:53:43Z)
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z)
Reinforcement Learning as a Parsimonious Alternative to Prediction Cascades: A Case Study on Image Segmentation [6.576180048533476]
PaSeR (Parsimonious with Reinforcement Learning) is a non-cascading, cost-aware learning pipeline. We show that PaSeR achieves better accuracy while minimizing computational cost relative to cascaded models. We introduce a new metric IoU/GigaFlop to evaluate the balance between cost and performance.
arXiv Detail & Related papers (2024-02-19T01:17:52Z)
Condensing Graphs via One-Step Gradient Matching [50.07587238142548]
We propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance.
arXiv Detail & Related papers (2022-06-15T18:20:01Z)
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [55.485985317538194]
ProcTHOR is a framework for procedural generation of Embodied AI environments. We demonstrate state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.
arXiv Detail & Related papers (2022-06-14T17:09:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.