Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models
- URL: http://arxiv.org/abs/2406.13099v1
- Date: Tue, 18 Jun 2024 23:14:29 GMT
- Title: Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models
- Authors: Paul Henderson, Melonie de Almeida, Daniela Ivanova, Titas Anciukevičius,
- Abstract summary: We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data.
We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, or from sparse input views.
- Score: 3.9373541926236766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models
Related papers
- SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images [49.7344030427291]
We study the problem of single-image 3D object reconstruction.
Recent works have diverged into two directions: regression-based modeling and generative modeling.
We present SPAR3D, a novel two-stage approach aiming to take the best of both directions.
arXiv Detail & Related papers (2025-01-08T18:52:03Z) - Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes.
We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z) - DSplats: 3D Generation by Denoising Splats-Based Multiview Diffusion Models [67.50989119438508]
We introduce DSplats, a novel method that directly denoises multiview images using Gaussian-based Reconstructors to produce realistic 3D assets.
Our experiments demonstrate that DSplats not only produces high-quality, spatially consistent outputs, but also sets a new standard in single-image to 3D reconstruction.
arXiv Detail & Related papers (2024-12-11T07:32:17Z) - Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation [45.95218923564575]
We propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view.
Experiments show that our method enjoys better generation quality (2.20 dB higher in PSNR and 23.25 lower in FID) and over 5x faster speed (6s on an A100 GPU) than SOTA methods.
arXiv Detail & Related papers (2024-11-21T18:21:24Z) - AugGS: Self-augmented Gaussians with Structural Masks for Sparse-view 3D Reconstruction [9.953394373473621]
Sparse-view 3D reconstruction is a major challenge in computer vision.
We propose a self-augmented two-stage Gaussian splatting framework enhanced with structural masks for sparse-view 3D reconstruction.
Our approach achieves state-of-the-art performance in perceptual quality and multi-view consistency with sparse inputs.
arXiv Detail & Related papers (2024-08-09T03:09:22Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content
Creation [51.19871052619077]
We introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images.
We maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.
arXiv Detail & Related papers (2024-02-07T17:57:03Z) - Pyramid Diffusion for Fine 3D Large Scene Generation [56.00726092690535]
Diffusion models have shown remarkable results in generating 2D images and small-scale 3D objects.
Their application to the synthesis of large-scale 3D scenes has been rarely explored.
We introduce a framework, the Pyramid Discrete Diffusion model (PDD), which employs scale-varied diffusion models to progressively generate high-quality outdoor scenes.
arXiv Detail & Related papers (2023-11-20T11:24:21Z) - HoloDiffusion: Training a 3D Diffusion Model using 2D Images [71.1144397510333]
We introduce a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision.
We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.
arXiv Detail & Related papers (2023-03-29T07:35:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.