Enhancing Monocular 3D Scene Completion with Diffusion Model
- URL: http://arxiv.org/abs/2503.00726v1
- Date: Sun, 02 Mar 2025 04:36:57 GMT
- Title: Enhancing Monocular 3D Scene Completion with Diffusion Model
- Authors: Changlin Song, Jiaqi Wang, Liyun Zhu, He Weng,
- Abstract summary: 3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving.<n>Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance.<n>We introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image.
- Score: 20.81599069390756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving, enabling machines to understand and interact with complex environments. Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance, but this dependence limits their use in scenarios where only a single image is available. In this work, we introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image, significantly reducing the need for multi-view inputs. Our approach leverages a pre-trained vision-language model to generate descriptive prompts for the scene, guiding a diffusion model to produce images from various perspectives, which are then fused to form a cohesive 3D reconstruction. Extensive experiments show that our method effectively and robustly expands single-image inputs into a comprehensive 3D scene, extending monocular 3D reconstruction capabilities without further training. Our code is available https://github.com/CharlieSong1999/FlashDreamer/tree/main.
Related papers
- IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos [33.12653115668027]
Our method generates Multiplane Images (MPIs) that ensure geometric consistency.
Our approach directly generates the final output through a single denoising process.
To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space.
arXiv Detail & Related papers (2025-04-27T08:56:02Z) - Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes.<n>We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z) - 3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance [61.06034736050515]
We introduce a method capable of generating camera-controlled viewpoints from a single input image.<n>Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data.
arXiv Detail & Related papers (2024-08-12T13:53:40Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - CAT3D: Create Anything in 3D with Multi-View Diffusion Models [87.80820708758317]
We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model.
CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation.
arXiv Detail & Related papers (2024-05-16T17:59:05Z) - Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.