MAD: Motion Appearance Decoupling for efficient Driving World Models
- URL: http://arxiv.org/abs/2601.09452v1
- Date: Wed, 14 Jan 2026 12:52:23 GMT
- Title: MAD: Motion Appearance Decoupling for efficient Driving World Models
- Authors: Ahmad Rahimi, Valentin Gerard, Eloi Zablocki, Matthieu Cord, Alexandre Alahi,
- Abstract summary: We propose an efficient adaptation framework that converts generalist video models into controllable driving world models.<n>Key idea is to decouple motion learning from appearance synthesis.<n>Scaling to LTX, our MAD-LTX model outperforms all open-source competitors.
- Score: 94.40548866741791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively "dressing" the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: https://vita-epfl.github.io/MAD-World-Model/
Related papers
- Walk through Paintings: Egocentric World Models from Internet Priors [65.30611174953958]
We present the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model.<n>Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers.<n>Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids.
arXiv Detail & Related papers (2026-01-21T18:59:32Z) - MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models [50.162882483045045]
We propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder.<n>This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics.<n>We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos.
arXiv Detail & Related papers (2025-10-21T19:05:23Z) - LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model [22.92353994818742]
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions.<n>Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility.<n>We propose several solutions to build a simple yet effective long-term driving world model.
arXiv Detail & Related papers (2025-06-02T11:19:23Z) - VaViM and VaVAM: Autonomous Driving through Video Generative Modeling [88.33638585518226]
We introduce an open-source auto-regressive video model (VaM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving.<n>We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving.
arXiv Detail & Related papers (2025-02-21T18:56:02Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.<n>VideoJAM achieves state-of-the-art performance in motion coherence.<n>These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z) - Motion Dreamer: Boundary Conditional Motion Reasoning for Physically Coherent Video Generation [27.690736225683825]
We introduce Motion Dreamer, a two-stage framework that explicitly separates motion reasoning from visual synthesis.<n>Our approach introduces instance flow, a sparse-to-dense motion representation enabling effective integration of partial user-defined motions.<n>Experiments demonstrate that Motion Dreamer significantly outperforms existing methods, achieving superior motion plausibility and visual realism.
arXiv Detail & Related papers (2024-11-30T17:40:49Z) - AVID: Adapting Video Diffusion Models to World Models [10.757223474031248]
We propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model.
AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos.
We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.
arXiv Detail & Related papers (2024-10-01T13:48:31Z) - DyNCA: Real-time Dynamic Texture Synthesis Using Neural Cellular
Automata [12.05119084381406]
We propose Dynamic Neural Cellular Automata (DyNCA), a framework for real-time and controllable dynamic texture synthesis.
Our method is built upon the recently introduced NCA models and can synthesize infinitely long and arbitrary-sized realistic video textures in real time.
Our model offers several real-time video controls including motion speed, motion direction, and an editing brush tool.
arXiv Detail & Related papers (2022-11-21T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.