LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model
- URL: http://arxiv.org/abs/2506.01546v1
- Date: Mon, 02 Jun 2025 11:19:23 GMT
- Title: LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model
- Authors: Xiaodong Wang, Zhirong Wu, Peixi Peng,
- Abstract summary: Driving world models are used to simulate futures by video generation based on the condition of the current state and actions.<n>Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility.<n>We propose several solutions to build a simple yet effective long-term driving world model.
- Score: 22.92353994818742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. In the public benchmark NuScenes, compared with the state-of-the-art front-view model, our model improves FVD by $27\%$ and reduces inference time by $85\%$ for the video task of generating 110+ frames. More videos (including 90s duration) are available at https://Wang-Xiaodong1899.github.io/longdwm/.
Related papers
- Epona: Autoregressive Diffusion World Model for Autonomous Driving [39.389981627403316]
Existing video diffusion models struggle with flexible-length, long-horizon predictions and integrating trajectory planning.<n>This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences.<n>We propose Epona, an autoregressive world model that enables localized distribution modeling.
arXiv Detail & Related papers (2025-06-30T17:56:35Z) - Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation [53.877572078307935]
Distilled video generation models offer fast and efficient but struggle with motion customization when guided by reference videos.<n>We propose MotionEcho, a training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing.
arXiv Detail & Related papers (2025-06-24T06:20:15Z) - Long-Context Autoregressive Video Modeling with Next-Frame Prediction [17.710915002557996]
Long-context video modeling is essential for enabling generative models to function as world simulators.<n>While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive.<n>We propose Frame AutoRegressive (FAR) models temporal dependencies between continuous frames, converges faster than video diffusion transformers, and outperforms token-level autoregressive models.
arXiv Detail & Related papers (2025-03-25T03:38:06Z) - MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving [26.00279480104371]
We propose MiLA, a framework for generating high-fidelity, long-duration videos up to one minute.<n>MiLA uses a Coarse-to-Re(fine) approach to stabilize video generation and correct distortion of dynamic objects.<n>Experiments on nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality.
arXiv Detail & Related papers (2025-03-20T05:58:32Z) - VaViM and VaVAM: Autonomous Driving through Video Generative Modeling [88.33638585518226]
We introduce an open-source auto-regressive video model (VaM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving.<n>We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving.
arXiv Detail & Related papers (2025-02-21T18:56:02Z) - DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT [33.943125216555316]
We present DrivingWorld, a GPT-style world model for autonomous driving.<n>We propose a next-state prediction strategy to model temporal coherence between consecutive frames.<n>We also propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues.
arXiv Detail & Related papers (2024-12-27T07:44:07Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [52.32078428442281]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models [66.84478240757038]
A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip.
We introduce causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames.
Our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation.
arXiv Detail & Related papers (2024-06-16T15:37:22Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Temporally Consistent Transformers for Video Generation [80.45230642225913]
To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world.
No established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies.
We introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time.
arXiv Detail & Related papers (2022-10-05T17:15:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.