Olaf-World: Orienting Latent Actions for Video World Modeling
- URL: http://arxiv.org/abs/2602.10104v1
- Date: Tue, 10 Feb 2026 18:58:41 GMT
- Title: Olaf-World: Orienting Latent Actions for Video World Modeling
- Authors: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou,
- Abstract summary: Scaling action-controllable world models is limited by the scarcity of action labels.<n>We present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video.
- Score: 100.96069208914957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
Related papers
- Learning Latent Action World Models In The Wild [50.453458324163705]
We study the problem of learning latent actions world models on in-the-wild videos.<n>We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos.<n>In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space.
arXiv Detail & Related papers (2026-01-08T18:55:39Z) - CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos [73.51386721543135]
We propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories.<n>CLAP maps video transitions onto a quantized, physically executable codebook.<n>We introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation.
arXiv Detail & Related papers (2026-01-07T16:26:33Z) - Latent Action World Models for Control with Unlabeled Trajectories [8.965084673299858]
We study world models that learn from heterogeneous data.<n>We introduce a family of latent-action world models that jointly use action-conditioned and action-free data.
arXiv Detail & Related papers (2025-12-10T19:09:45Z) - Astra: General Interactive World Model with Autoregressive Denoising [73.6594791733982]
Astra is an interactive general world model that generates real-world futures for diverse scenarios.<n>We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations.<n>Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions.
arXiv Detail & Related papers (2025-12-09T18:59:57Z) - Precise Action-to-Video Generation Through Visual Action Prompts [62.951609704196485]
Action-driven video generation faces a precision-generality trade-off.<n>Agent-centric action signals provide precision at the cost of cross-domain transferability.<n>We "render" actions into precise visual prompts as domain-agnostic representations.
arXiv Detail & Related papers (2025-08-18T17:12:28Z) - FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios [49.09128364751743]
Action customization involves generating videos where the subject performs actions dictated by input control signals.<n>Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure.<n>We propose FlexiAct, which transfers actions from a reference video to an arbitrary target image.
arXiv Detail & Related papers (2025-05-06T17:58:02Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with
Hierarchical Atomic Actions [13.665489987620724]
We tackle the problem of weakly-supervised fine-grained temporal action detection in videos for the first time.
We propose to model actions as the combinations of reusable atomic actions which are automatically discovered from data.
Our approach constructs a visual representation hierarchy of four levels: clip level, atomic action level, fine action class level and coarse action class level, with supervision at each level.
arXiv Detail & Related papers (2022-07-24T20:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.