DrivingDiffusion: Layout-Guided multi-view driving scene video
generation with latent diffusion model
- URL: http://arxiv.org/abs/2310.07771v1
- Date: Wed, 11 Oct 2023 18:00:08 GMT
- Title: DrivingDiffusion: Layout-Guided multi-view driving scene video
generation with latent diffusion model
- Authors: Xiaofan Li, Yifu Zhang and Xiaoqing Ye
- Abstract summary: We propose DrivingDiffusion to generate realistic multi-view videos controlled by 3D layout.
Our model can generate large-scale realistic multi-camera driving videos in complex urban scenes.
- Score: 19.288610627281102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing popularity of autonomous driving based on the powerful
and unified bird's-eye-view (BEV) representation, a demand for high-quality and
large-scale multi-view video data with accurate annotation is urgently
required. However, such large-scale multi-view data is hard to obtain due to
expensive collection and annotation costs. To alleviate the problem, we propose
a spatial-temporal consistent diffusion framework DrivingDiffusion, to generate
realistic multi-view videos controlled by 3D layout. There are three challenges
when synthesizing multi-view videos given a 3D layout: How to keep 1)
cross-view consistency and 2) cross-frame consistency? 3) How to guarantee the
quality of the generated instances? Our DrivingDiffusion solves the problem by
cascading the multi-view single-frame image generation step, the single-view
video generation step shared by multiple cameras, and post-processing that can
handle long video generation. In the multi-view model, the consistency of
multi-view images is ensured by information exchange between adjacent cameras.
In the temporal model, we mainly query the information that needs attention in
subsequent frame generation from the multi-view images of the first frame. We
also introduce the local prompt to effectively improve the quality of generated
instances. In post-processing, we further enhance the cross-view consistency of
subsequent frames and extend the video length by employing temporal sliding
window algorithm. Without any extra cost, our model can generate large-scale
realistic multi-camera driving videos in complex urban scenes, fueling the
downstream driving tasks. The code will be made publicly available.
Related papers
- Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.
We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.
In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention [87.02613021058484]
We introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image.
Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing complexity by 12x times.
arXiv Detail & Related papers (2024-05-19T17:13:16Z) - VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model [34.35449902855767]
Two fundamental questions are what data we use for training and how to ensure multi-view consistency.
We propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models.
Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T17:48:15Z) - Envision3D: One Image to 3D with Anchor Views Interpolation [18.31796952040799]
We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image.
It is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.
arXiv Detail & Related papers (2024-03-13T18:46:33Z) - LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content
Creation [51.19871052619077]
We introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images.
We maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.
arXiv Detail & Related papers (2024-02-07T17:57:03Z) - SyncDreamer: Generating Multiview-consistent Images from a Single-view Image [59.75474518708409]
A novel diffusion model called SyncDreamer generates multiview-consistent images from a single-view image.
Experiments show that SyncDreamer generates images with high consistency across different views.
arXiv Detail & Related papers (2023-09-07T02:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.