Related papers: 3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

URL: http://arxiv.org/abs/2408.06157v3
Date: Tue, 8 Oct 2024 03:03:37 GMT
Title: 3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance
Authors: Taewon Kang, Divya Kothandaraman, Dinesh Manocha, Ming C. Lin,
Abstract summary: We introduce a method capable of generating camera-controlled viewpoints from a single input image. Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data.
Score: 61.06034736050515
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond the training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without the need for a large amount of 3D-based training data, but lack camera control. In this paper, we introduce a method capable of generating camera-controlled viewpoints from a single input image, by combining the benefits of 3D-free and 3D-based approaches. Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results. Experimental results demonstrate that our method outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.

Related papers

Enhancing Monocular 3D Scene Completion with Diffusion Model [20.81599069390756]
3D scene reconstruction is essential for applications in virtual reality, robotics, and autonomous driving. Traditional 3D Gaussian Splatting techniques rely on images captured from multiple viewpoints to achieve optimal performance. We introduce FlashDreamer, a novel approach for reconstructing a complete 3D scene from a single image.
arXiv Detail & Related papers (2025-03-02T04:36:57Z)
Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z)
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior [52.44678180286886]
2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. We propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously.
arXiv Detail & Related papers (2023-12-11T18:59:18Z)
WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs) Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images [70.17085345196583]
Control3Diff is a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet.
arXiv Detail & Related papers (2023-04-13T17:52:29Z)
3inGAN: Learning a 3D Generative Model from Images of a Self-similar Scene [34.2144933185175]
3inGAN is an unconditional 3D generative model trained from 2D images of a single self-similar 3D scene. We show results on semi-stochastic scenes of varying scale and complexity, obtained from real and synthetic sources.
arXiv Detail & Related papers (2022-11-27T18:03:21Z)
Novel View Synthesis with Diffusion Models [56.55571338854636]
We present 3DiM, a diffusion model for 3D novel view synthesis. It is able to translate a single input view into consistent and sharp completions across many views. 3DiM can generate multiple views that are 3D consistent using a novel technique called conditioning.
arXiv Detail & Related papers (2022-10-06T16:59:56Z)
AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation [27.163052958878776]
This paper targets on learning-based novel view synthesis from a single or limited 2D images without the pose supervision. We construct an end-to-end trainable conditional variational framework to disentangle the unsupervisely learned relative-pose/rotation and implicit global 3D representation. Our system can achieve implicitly 3D understanding without explicitly 3D reconstruction.
arXiv Detail & Related papers (2020-07-13T18:51:27Z)
Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data. We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.