VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
- URL: http://arxiv.org/abs/2602.10102v1
- Date: Tue, 10 Feb 2026 18:58:19 GMT
- Title: VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
- Authors: Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin,
- Abstract summary: VideoWorld 2 offers the first investigation into learning transferable knowledge directly from raw real-world videos.<n>VideoWorld 2 learns latent codes that focus on compact and meaningful task-related dynamics.<n>Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate.
- Score: 128.28225837148713
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
Related papers
- Learning Skills from Action-Free Videos [20.442392109789772]
We introduce Skill Abstraction from Optical Flow (SOF), a framework that learns latent skills from action-free videos.<n>SOF enables high-level planning over video-derived skills and allows for easier translation of these skills into actions.<n> Experiments show that our approach consistently improves performance in both multitask and long-horizon settings.
arXiv Detail & Related papers (2025-12-23T05:03:33Z) - Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z) - Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z) - Vid2World: Crafting Video Diffusion Models to Interactive World Models [35.42362065437052]
We present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models.<n>Our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.
arXiv Detail & Related papers (2025-05-20T13:41:45Z) - VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input.<n>We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.