HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving
- URL: http://arxiv.org/abs/2412.01407v2
- Date: Tue, 03 Dec 2024 13:14:39 GMT
- Title: HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving
- Authors: Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, Yuwen Xiong,
- Abstract summary: We propose our framework, emphHoloDrive, to jointly generate the camera images and LiDAR point clouds.
We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models.
Our method leads to significant performance gains over SOTA methods in terms of generation metrics.
- Score: 29.327572707959916
- License:
- Abstract: Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.
Related papers
- LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation [51.36926306499593]
Prometheus is a 3D-aware latent diffusion model for text-to-3D generation at both object and scene levels in seconds.
We formulate 3D scene generation as multi-view, feed-forward, pixel-aligned 3D Gaussian generation within the latent diffusion paradigm.
arXiv Detail & Related papers (2024-12-30T17:44:23Z) - VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving [25.03216574230919]
We propose VQA-Diff, a novel framework that leverages in-the-wild vehicle images to create 3D vehicle assets for autonomous driving.
VQA-Diff exploits the real-world knowledge inherited from the Large Language Model in the Visual Question Answering (VQA) model for robust zero-shot prediction.
We conduct experiments on various datasets, including Pascal 3D+, to demonstrate that VQA-Diff outperforms existing state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-07-09T03:09:55Z) - MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes [72.02827211293736]
We introduce MagicDrive3D, a novel pipeline for controllable 3D street scene generation.
Unlike previous methods that reconstruct before training the generative models, MagicDrive3D first trains a video generation model and then reconstructs from the generated data.
Our results demonstrate the framework's superior performance, showcasing its potential for autonomous driving simulation and beyond.
arXiv Detail & Related papers (2024-05-23T12:04:51Z) - X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation [61.48050470095969]
X-Dreamer is a novel approach for high-quality text-to-3D content creation.
It bridges the gap between text-to-2D and text-to-3D synthesis.
arXiv Detail & Related papers (2023-11-30T07:23:00Z) - Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models [97.58685709663287]
generative pre-training can boost the performance of fundamental models in 2D vision.
In 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training.
We propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model.
arXiv Detail & Related papers (2023-07-27T16:07:03Z) - GINA-3D: Learning to Generate Implicit Neural Assets in the Wild [38.51391650845503]
GINA-3D is a generative model that uses real-world driving data from camera and LiDAR sensors to create 3D implicit neural assets of diverse vehicles and pedestrians.
We construct a large-scale object-centric dataset containing over 1.2M images of vehicles and pedestrians.
We demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.
arXiv Detail & Related papers (2023-04-04T23:41:20Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.