DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation
- URL: http://arxiv.org/abs/2409.05463v4
- Date: Thu, 12 Sep 2024 12:32:21 GMT
- Title: DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation
- Authors: Wei Wu, Xi Guo, Weixuan Tang, Tingxuan Huang, Chiyu Wang, Dongyue Chen, Chenjing Ding,
- Abstract summary: DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation.
Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information.
DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
- Score: 10.296670127024045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in generative models have provided promising solutions for synthesizing realistic driving videos, which are crucial for training autonomous driving perception models. However, existing approaches often struggle with multi-view video generation due to the challenges of integrating 3D information while maintaining spatial-temporal consistency and effectively learning from a unified model. We propose DriveScape, an end-to-end framework for multi-view, 3D condition-guided video generation, capable of producing 1024 x 576 high-resolution videos at 10Hz. Unlike other methods limited to 2Hz due to the 3D box annotation frame rate, DriveScape overcomes this with its ability to operate under sparse conditions. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information, maintaining spatial-temporal consistency. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39. Our project homepage: https://metadrivescape.github.io/papers_project/drivescapev1/index.html
Related papers
- DreamDrive: Generative 4D Scene Modeling from Street View Images [55.45852373799639]
We present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction.
Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references.
We then render 3D-consistent driving videos via Gaussian splatting.
arXiv Detail & Related papers (2024-12-31T18:59:57Z) - DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT [33.943125216555316]
We present DrivingWorld, a GPT-style world model for autonomous driving.
We propose a next-state prediction strategy to model temporal coherence between consecutive frames.
We also propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues.
arXiv Detail & Related papers (2024-12-27T07:44:07Z) - Physical Informed Driving World Model [47.04423342994622]
DrivePhysica is an innovative model designed to generate realistic driving videos that adhere to essential physical principles.
We achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks.
arXiv Detail & Related papers (2024-12-11T14:29:35Z) - Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model [83.31688383891871]
We propose a Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct real-world scenes.
Stag-1 constructs continuous 4D point cloud scenes using surround-view data from autonomous vehicles.
It decouples spatial-temporal relationships and produces coherent driving videos.
arXiv Detail & Related papers (2024-12-06T18:59:56Z) - UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving [18.189392365510848]
UniMLVG is a unified framework designed to generate extended street multi-perspective videos under precise control.
By integrating single- and multi-view driving videos into the training data, our approach updates cross-frame and cross-view modules across three stages.
Our framework achieves improvements of 21.4% in FID and 36.5% in FVD.
arXiv Detail & Related papers (2024-12-06T08:27:53Z) - InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models [75.03495065452955]
We present InfiniCube, a scalable method for generating dynamic 3D driving scenes with high fidelity and controllability.
Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
arXiv Detail & Related papers (2024-12-05T07:32:20Z) - Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.
CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.
CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z) - MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control [68.74166535159311]
We introduce MagicDriveDiT, a novel approach based on the DiT architecture.
By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents.
Experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames.
arXiv Detail & Related papers (2024-11-21T03:13:30Z) - DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes [15.506076058742744]
We propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation.
To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding.
We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos.
arXiv Detail & Related papers (2024-09-06T03:09:58Z) - MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes [72.02827211293736]
We introduce MagicDrive3D, a novel pipeline for controllable 3D street scene generation.
Unlike previous methods that reconstruct before training the generative models, MagicDrive3D first trains a video generation model and then reconstructs from the generated data.
Our results demonstrate the framework's superior performance, showcasing its potential for autonomous driving simulation and beyond.
arXiv Detail & Related papers (2024-05-23T12:04:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.