Related papers: DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

URL: http://arxiv.org/abs/2412.19505v2
Date: Mon, 30 Dec 2024 09:08:12 GMT
Title: DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT
Authors: Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, Ping Tan,
Abstract summary: We present DrivingWorld, a GPT-style world model for autonomous driving.<n>We propose a next-state prediction strategy to model temporal coherence between consecutive frames.<n>We also propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues.
Score: 33.943125216555316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video generation. Specifically, we propose a next-state prediction strategy to model temporal coherence between consecutive frames and apply a next-token prediction strategy to capture spatial information within each frame. To further enhance generalization ability, we propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues and enable precise control. Our work demonstrates the ability to produce high-fidelity and consistent video clips of over 40 seconds in duration, which is over 2 times longer than state-of-the-art driving world models. Experiments show that, in contrast to prior works, our method achieves superior visual quality and significantly more accurate controllable future video generation. Our code is available at https://github.com/YvanYin/DrivingWorld.

Related papers

Epona: Autoregressive Diffusion World Model for Autonomous Driving [39.389981627403316]
Existing video diffusion models struggle with flexible-length, long-horizon predictions and integrating trajectory planning.<n>This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences.<n>We propose Epona, an autoregressive world model that enables localized distribution modeling.
arXiv Detail & Related papers (2025-06-30T17:56:35Z)
LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model [22.92353994818742]
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions.<n>Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility.<n>We propose several solutions to build a simple yet effective long-term driving world model.
arXiv Detail & Related papers (2025-06-02T11:19:23Z)
Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis [12.160537328404622]
textttDRA-Ctrl provides new insights into reusing resource-intensive video models.<n>textttDRA-Ctrl lays foundation for future unified generative models across visual modalities.
arXiv Detail & Related papers (2025-05-29T10:34:45Z)
ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos [13.630119246378518]
We argue that driving world models should have two additional abilities: action control and action prediction.<n>We propose ProphetDWM, a novel end-to-end driving world model that jointly predicts future videos and actions.
arXiv Detail & Related papers (2025-05-24T11:35:09Z)
Learning 3D Persistent Embodied World Models [84.40585374179037]
We introduce a new persistent embodied world model with an explicit memory of previously generated content.<n>During generation time, our video diffusion model predicts RGB-D video of the future observations of the agent.<n>This generation is then aggregated into a persistent 3D map of the environment.
arXiv Detail & Related papers (2025-05-05T17:59:17Z)
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers [61.92571851411509]
We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning.<n>Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:37Z)
ARCON: Advancing Auto-Regressive Continuation for Driving Videos [7.958859992610155]
This paper explores the use of Large Vision Models (LVMs) for video continuation. We introduce ARCON, a scheme that alternates between generating semantic and RGB tokens, allowing the LVM to explicitly learn high-level structural video information. Experiments in autonomous driving scenarios show that our model can consistently generate long videos.
arXiv Detail & Related papers (2024-12-04T22:53:56Z)
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control [68.74166535159311]
We introduce MagicDriveDiT, a novel approach based on the DiT architecture. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames.
arXiv Detail & Related papers (2024-11-21T03:13:30Z)
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey [61.39993881402787]
World models and video generation are pivotal technologies in the domain of autonomous driving. This paper investigates the relationship between these two technologies. By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions.
arXiv Detail & Related papers (2024-11-05T08:58:35Z)
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)
iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
GenAD: Generalized Predictive Model for Autonomous Driving [75.39517472462089]
We introduce the first large-scale video prediction model in the autonomous driving discipline. Our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. It can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
arXiv Detail & Related papers (2024-03-14T17:58:33Z)
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models. Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.