Walk through Paintings: Egocentric World Models from Internet Priors
- URL: http://arxiv.org/abs/2601.15284v1
- Date: Wed, 21 Jan 2026 18:59:32 GMT
- Title: Walk through Paintings: Egocentric World Models from Internet Priors
- Authors: Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert,
- Abstract summary: We present the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model.<n>Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers.<n>Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids.
- Score: 65.30611174953958
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.
Related papers
- Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z) - MAD: Motion Appearance Decoupling for efficient Driving World Models [94.40548866741791]
We propose an efficient adaptation framework that converts generalist video models into controllable driving world models.<n>Key idea is to decouple motion learning from appearance synthesis.<n>Scaling to LTX, our MAD-LTX model outperforms all open-source competitors.
arXiv Detail & Related papers (2026-01-14T12:52:23Z) - EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos [25.047225764745978]
We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild.<n>In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.
arXiv Detail & Related papers (2026-01-03T03:08:48Z) - Dexterous World Models [24.21588354488453]
Dexterous World Model (DWM) is a scene-action-conditioned video diffusion framework.<n>We show how DWM generates temporally coherent videos depicting plausible human-scene interactions.<n>Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects.
arXiv Detail & Related papers (2025-12-19T18:59:51Z) - Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos [24.111891848073288]
Embodied world models aim to predict and interact with the physical world through visual observations and actions.<n>MTV-World introduces Multi-view Trajectory-Video control for precise visuomotor prediction.<n>MTV-World achieves precise control execution and accurate physical interaction modeling in complex dual-arm scenarios.
arXiv Detail & Related papers (2025-11-17T02:17:04Z) - Learning Primitive Embodied World Models: Towards Scalable Robotic Learning [50.32986780156215]
We propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM)<n>By restricting video generation to fixed short horizons, our approach enables fine-grained alignment between linguistic concepts and visual representations of robotic actions.<n>Our framework bridges the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
arXiv Detail & Related papers (2025-08-28T14:31:48Z) - Whole-Body Conditioned Egocentric Video Prediction [98.94980209293776]
We train models to Predict Ego-centric Video from human Actions (PEVA)<n>By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view.<n>Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
arXiv Detail & Related papers (2025-06-26T17:59:59Z) - PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning [38.004463823796286]
We formulate the motor system of an interactive avatar as a generative motion model.<n>Inspired by recent advances in foundation models, we propose PRIMAL.<n>We leverage the model to create a real-time character animation system in Unreal Engine that feels highly responsive and natural.
arXiv Detail & Related papers (2025-03-21T21:27:57Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.