Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images
- URL: http://arxiv.org/abs/2508.02323v1
- Date: Mon, 04 Aug 2025 11:43:12 GMT
- Title: Dream-to-Recon: Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images
- Authors: Philipp Wulff, Felix Wimbauer, Dominik Muhle, Daniel Cremers,
- Abstract summary: We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image.<n>Our experiments on the challenging KITTI-360 and datasets demonstrate that our method matches or outperforms state-of-the-art baselines.
- Score: 39.08243715525956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.
Related papers
- DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos [52.46386528202226]
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM)<n>It is the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene.<n>It achieves performance on par with state-of-the-art monocular video 3D tracking methods.
arXiv Detail & Related papers (2025-06-11T17:59:58Z) - SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations [44.53106180688135]
This work takes on the challenge of reconstructing 3D scenes from sparse or single-view inputs.<n>We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations.<n>Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency.
arXiv Detail & Related papers (2025-05-17T13:05:13Z) - IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos [33.12653115668027]
Our method generates Multiplane Images (MPIs) that ensure geometric consistency.<n>Our approach directly generates the final output through a single denoising process.<n>To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space.
arXiv Detail & Related papers (2025-04-27T08:56:02Z) - Enhancing Monocular 3D Scene Completion with Diffusion Model [20.81599069390756]
3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving.<n>Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance.<n>We introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image.
arXiv Detail & Related papers (2025-03-02T04:36:57Z) - Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner.<n>We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction
Model [86.37536249046943]
textbfDMV3D is a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion.
Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering.
arXiv Detail & Related papers (2023-11-15T18:58:41Z) - Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and
Reconstruction [77.69363640021503]
3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images.
We present SSDNeRF, a unified approach that employs an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects.
arXiv Detail & Related papers (2023-04-13T17:59:01Z) - Shape, Pose, and Appearance from a Single Image via Bootstrapped
Radiance Field Inversion [54.151979979158085]
We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available.
We leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution.
Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios.
arXiv Detail & Related papers (2022-11-21T17:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.