Related papers: DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

URL: http://arxiv.org/abs/2403.06845v2
Date: Thu, 11 Apr 2024 04:17:13 GMT
Title: DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
Authors: Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang,
Abstract summary: We propose DriveDreamer-2, which builds upon the framework of DriveDreamer to generate user-defined driving videos. Ultimately, we propose the Unified Multi-View Model to enhance temporal and spatial coherence in the generated driving videos.
Score: 32.30436679335912
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models have demonstrated superiority in autonomous driving, particularly in the generation of multi-view driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which builds upon the framework of DriveDreamer and incorporates a Large Language Model (LLM) to generate user-defined driving videos. Specifically, an LLM interface is initially incorporated to convert a user's query into agent trajectories. Subsequently, a HDMap, adhering to traffic regulations, is generated based on the trajectories. Ultimately, we propose the Unified Multi-View Model to enhance temporal and spatial coherence in the generated driving videos. DriveDreamer-2 is the first world model to generate customized driving videos, it can generate uncommon driving videos (e.g., vehicles abruptly cut in) in a user-friendly manner. Besides, experimental results demonstrate that the generated videos enhance the training of driving perception methods (e.g., 3D detection and tracking). Furthermore, video generation quality of DriveDreamer-2 surpasses other state-of-the-art methods, showcasing FID and FVD scores of 11.2 and 55.7, representing relative improvements of 30% and 50%.

Related papers

DreamDrive: Generative 4D Scene Modeling from Street View Images [55.45852373799639]
We present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references. We then render 3D-consistent driving videos via Gaussian splatting.
arXiv Detail & Related papers (2024-12-31T18:59:57Z)
Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z)
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control [68.74166535159311]
We introduce MagicDriveDiT, a novel approach based on the DiT architecture. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames.
arXiv Detail & Related papers (2024-11-21T03:13:30Z)
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation [32.19534057884047]
We introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios.
arXiv Detail & Related papers (2024-10-17T14:07:46Z)
DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model [65.43473733967038]
We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge.
arXiv Detail & Related papers (2024-10-14T17:19:23Z)
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)
DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes [15.506076058742744]
We propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos.
arXiv Detail & Related papers (2024-09-06T03:09:58Z)
DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving [12.004604110512421]
Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. We propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them.
arXiv Detail & Related papers (2024-08-29T15:52:56Z)
GenAD: Generalized Predictive Model for Autonomous Driving [75.39517472462089]
We introduce the first large-scale video prediction model in the autonomous driving discipline. Our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. It can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
arXiv Detail & Related papers (2024-03-14T17:58:33Z)
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion [52.7394517692186]
We present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern.
arXiv Detail & Related papers (2023-12-07T16:57:26Z)
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models. Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z)
DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving [76.24483706445298]
We introduce DriveDreamer, a world model entirely derived from real-world driving scenarios. In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.
arXiv Detail & Related papers (2023-09-18T13:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.