VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
- URL: http://arxiv.org/abs/2510.19488v1
- Date: Wed, 22 Oct 2025 11:25:48 GMT
- Title: VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
- Authors: Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu,
- Abstract summary: VideoAgentTrek is a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale.<n>Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels.<n>applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically.
- Score: 62.29924199978745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
Related papers
- D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI [26.33451769892426]
We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks.<n>Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation.
arXiv Detail & Related papers (2025-10-07T08:40:33Z) - Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention [0.0]
We introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task.<n>Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP, and processes the resulting feature representations with a multihead attention model.<n>The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference.
arXiv Detail & Related papers (2025-04-13T22:08:18Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.<n>We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.<n>Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.