Viewpoint Textual Inversion: Unleashing Novel View Synthesis with
Pretrained 2D Diffusion Models
- URL: http://arxiv.org/abs/2309.07986v1
- Date: Thu, 14 Sep 2023 18:52:16 GMT
- Title: Viewpoint Textual Inversion: Unleashing Novel View Synthesis with
Pretrained 2D Diffusion Models
- Authors: James Burgess, Kuan-Chieh Wang, and Serena Yeung
- Abstract summary: We show that 3D knowledge is encoded in 2D image diffusion models like Stable Diffusion.
Our method, Viewpoint Neural Textual Inversion (ViewNeTI), controls the 3D viewpoint of objects in generated images from frozen diffusion models.
- Score: 13.760540874218705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image diffusion models understand spatial relationship between
objects, but do they represent the true 3D structure of the world from only 2D
supervision? We demonstrate that yes, 3D knowledge is encoded in 2D image
diffusion models like Stable Diffusion, and we show that this structure can be
exploited for 3D vision tasks. Our method, Viewpoint Neural Textual Inversion
(ViewNeTI), controls the 3D viewpoint of objects in generated images from
frozen diffusion models. We train a small neural mapper to take camera
viewpoint parameters and predict text encoder latents; the latents then
condition the diffusion generation process to produce images with the desired
camera viewpoint.
ViewNeTI naturally addresses Novel View Synthesis (NVS). By leveraging the
frozen diffusion model as a prior, we can solve NVS with very few input views;
we can even do single-view novel view synthesis. Our single-view NVS
predictions have good semantic details and photorealism compared to prior
methods. Our approach is well suited for modeling the uncertainty inherent in
sparse 3D vision problems because it can efficiently generate diverse samples.
Our view-control mechanism is general, and can even change the camera view in
images generated by user-defined prompts.
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving.
It predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
It is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs)
Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry.
This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z) - DreamSparse: Escaping from Plato's Cave with 2D Frozen Diffusion Model
Given Sparse Views [20.685453627120832]
Existing methods often struggle with producing high-quality results or necessitate per-object optimization in such few-view settings.
DreamSparse is capable of synthesizing high-quality novel views for both object and scene-level images.
arXiv Detail & Related papers (2023-06-06T05:26:26Z) - Generative Novel View Synthesis with 3D-Aware Diffusion Models [96.78397108732233]
We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image.
Our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume.
In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences.
arXiv Detail & Related papers (2023-04-05T17:15:47Z) - 3DDesigner: Towards Photorealistic 3D Object Generation and Editing with
Text-guided Diffusion Models [71.25937799010407]
We equip text-guided diffusion models to achieve 3D-consistent generation.
We study 3D local editing and propose a two-step solution.
We extend our model to perform one-shot novel view synthesis.
arXiv Detail & Related papers (2022-11-25T13:50:00Z) - Novel View Synthesis with Diffusion Models [56.55571338854636]
We present 3DiM, a diffusion model for 3D novel view synthesis.
It is able to translate a single input view into consistent and sharp completions across many views.
3DiM can generate multiple views that are 3D consistent using a novel technique called conditioning.
arXiv Detail & Related papers (2022-10-06T16:59:56Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Continuous Object Representation Networks: Novel View Synthesis without
Target View Supervision [26.885846254261626]
Continuous Object Representation Networks (CORN) is a conditional architecture that encodes an input image's geometry and appearance that map to a 3D consistent scene representation.
CORN achieves well on challenging tasks such as novel view synthesis and single-view 3D reconstruction and performance comparable to state-of-the-art approaches that use direct supervision.
arXiv Detail & Related papers (2020-07-30T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.