Geometry-Free View Synthesis: Transformers and no 3D Priors
- URL: http://arxiv.org/abs/2104.07652v1
- Date: Thu, 15 Apr 2021 17:58:05 GMT
- Title: Geometry-Free View Synthesis: Transformers and no 3D Priors
- Authors: Robin Rombach and Patrick Esser and Bj\"orn Ommer
- Abstract summary: We show that a transformer-based model can synthesize entirely novel views without any hand-engineered 3D biases.
This is achieved by (i) a global attention mechanism for implicitly learning long-range 3D correspondences between source and target views.
- Score: 16.86600007830682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Is a geometric model required to synthesize novel views from a single image?
Being bound to local convolutions, CNNs need explicit 3D biases to model
geometric transformations. In contrast, we demonstrate that a transformer-based
model can synthesize entirely novel views without any hand-engineered 3D
biases. This is achieved by (i) a global attention mechanism for implicitly
learning long-range 3D correspondences between source and target views, and
(ii) a probabilistic formulation necessary to capture the ambiguity inherent in
predicting novel views from a single image, thereby overcoming the limitations
of previous approaches that are restricted to relatively small viewpoint
changes. We evaluate various ways to integrate 3D priors into a transformer
architecture. However, our experiments show that no such geometric priors are
required and that the transformer is capable of implicitly learning 3D
relationships between images. Furthermore, this approach outperforms the state
of the art in terms of visual quality while covering the full distribution of
possible realizations. Code is available at https://git.io/JOnwn
Related papers
- Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs)
Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry.
This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z) - Multiple View Geometry Transformers for 3D Human Pose Estimation [35.26756920323391]
We aim to improve the 3D reasoning ability of Transformers in multi-view 3D human pose estimation.
We propose a novel hybrid model, MVGFormer, which has a series of geometric and appearance modules organized in an iterative manner.
arXiv Detail & Related papers (2023-11-18T06:32:40Z) - SparseFusion: Distilling View-conditioned Diffusion for 3D
Reconstruction [26.165314261806603]
We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation.
Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes.
arXiv Detail & Related papers (2022-12-01T18:59:55Z) - High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views.
Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z) - Novel View Synthesis with Diffusion Models [56.55571338854636]
We present 3DiM, a diffusion model for 3D novel view synthesis.
It is able to translate a single input view into consistent and sharp completions across many views.
3DiM can generate multiple views that are 3D consistent using a novel technique called conditioning.
arXiv Detail & Related papers (2022-10-06T16:59:56Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Disentangled3D: Learning a 3D Generative Model with Disentangled
Geometry and Appearance from Monocular Images [94.49117671450531]
State-of-the-art 3D generative models are GANs which use neural 3D volumetric representations for synthesis.
In this paper, we design a 3D GAN which can learn a disentangled model of objects, just from monocular observations.
arXiv Detail & Related papers (2022-03-29T22:03:18Z) - PixelSynth: Generating a 3D-Consistent Experience from a Single Image [30.64117903216323]
We present an approach that fuses 3D reasoning with autoregressive modeling to outpaint large view changes in a 3D-consistent manner.
We demonstrate considerable improvement in single image large-angle view synthesis results compared to a variety of methods and possible variants.
arXiv Detail & Related papers (2021-08-12T17:59:31Z) - AUTO3D: Novel view synthesis through unsupervisely learned variational
viewpoint and global 3D representation [27.163052958878776]
This paper targets on learning-based novel view synthesis from a single or limited 2D images without the pose supervision.
We construct an end-to-end trainable conditional variational framework to disentangle the unsupervisely learned relative-pose/rotation and implicit global 3D representation.
Our system can achieve implicitly 3D understanding without explicitly 3D reconstruction.
arXiv Detail & Related papers (2020-07-13T18:51:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.