VaLID: Variable-Length Input Diffusion for Novel View Synthesis
- URL: http://arxiv.org/abs/2312.08892v1
- Date: Thu, 14 Dec 2023 12:52:53 GMT
- Title: VaLID: Variable-Length Input Diffusion for Novel View Synthesis
- Authors: Shijie Li, Farhad G. Zanjani, Haitam Ben Yahia, Yuki M. Asano, Juergen
Gall, Amirhossein Habibian
- Abstract summary: Novel View Synthesis (NVS), which tries to produce a realistic image at the target view given source view images and their corresponding poses, is a fundamental problem in 3D Vision.
We try to process each pose image pair separately and then fuse them as a unified visual representation which will be injected into the model.
The Multi-view Cross Former module is proposed which maps variable-length input data to fix-size output data.
- Score: 36.57742242154048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Novel View Synthesis (NVS), which tries to produce a realistic image at the
target view given source view images and their corresponding poses, is a
fundamental problem in 3D Vision. As this task is heavily under-constrained,
some recent work, like Zero123, tries to solve this problem with generative
modeling, specifically using pre-trained diffusion models. Although this
strategy generalizes well to new scenes, compared to neural radiance
field-based methods, it offers low levels of flexibility. For example, it can
only accept a single-view image as input, despite realistic applications often
offering multiple input images. This is because the source-view images and
corresponding poses are processed separately and injected into the model at
different stages. Thus it is not trivial to generalize the model into
multi-view source images, once they are available. To solve this issue, we try
to process each pose image pair separately and then fuse them as a unified
visual representation which will be injected into the model to guide image
synthesis at the target-views. However, inconsistency and computation costs
increase as the number of input source-view images increases. To solve these
issues, the Multi-view Cross Former module is proposed which maps
variable-length input data to fix-size output data. A two-stage training
strategy is introduced to further improve the efficiency during training time.
Qualitative and quantitative evaluation over multiple datasets demonstrates the
effectiveness of the proposed method against previous approaches. The code will
be released according to the acceptance.
Related papers
- Efficient-3DiM: Learning a Generalizable Single-image Novel-view
Synthesizer in One Day [63.96075838322437]
We propose a framework to learn a single-image novel-view synthesizer.
Our framework is able to reduce the total training time from 10 days to less than 1 day.
arXiv Detail & Related papers (2023-10-04T17:57:07Z) - The Change You Want to See (Now in 3D) [65.61789642291636]
The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene.
We contribute a change detection model that is trained entirely on synthetic data and is class-agnostic.
We release a new evaluation dataset consisting of real-world image pairs with human-annotated differences.
arXiv Detail & Related papers (2023-08-21T01:59:45Z) - Capturing the motion of every joint: 3D human pose and shape estimation
with independent tokens [34.50928515515274]
We present a novel method to estimate 3D human pose and shape from monocular videos.
The proposed method attains superior performances on the 3DPW and Human3.6M datasets.
arXiv Detail & Related papers (2023-03-01T07:48:01Z) - Novel View Synthesis with Diffusion Models [56.55571338854636]
We present 3DiM, a diffusion model for 3D novel view synthesis.
It is able to translate a single input view into consistent and sharp completions across many views.
3DiM can generate multiple views that are 3D consistent using a novel technique called conditioning.
arXiv Detail & Related papers (2022-10-06T16:59:56Z) - im2nerf: Image to Neural Radiance Field in the Wild [47.18702901448768]
im2nerf is a learning framework that predicts a continuous neural object representation given a single input image in the wild.
We show that im2nerf achieves the state-of-the-art performance for novel view synthesis from a single-view unposed image in the wild.
arXiv Detail & Related papers (2022-09-08T23:28:56Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - Image Generation with Multimodal Priors using Denoising Diffusion
Probabilistic Models [54.1843419649895]
A major challenge in using generative models to accomplish this task is the lack of paired data containing all modalities and corresponding outputs.
We propose a solution based on a denoising diffusion probabilistic synthesis models to generate images under multi-model priors.
arXiv Detail & Related papers (2022-06-10T12:23:05Z) - ViewFormer: NeRF-free Neural Rendering from Few Images Using
Transformers [34.4824364161812]
Novel view synthesis is a problem where we are given only a few context views sparsely covering a scene or an object.
The goal is to predict novel viewpoints in the scene, which requires learning priors.
We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network.
arXiv Detail & Related papers (2022-03-18T21:08:23Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.