Related papers: NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

URL: http://arxiv.org/abs/2412.03517v2
Date: Fri, 06 Dec 2024 13:56:50 GMT
Title: NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images
Authors: Lingen Li, Zhaoyang Zhang, Yaowei Li, Jiale Xu, Wenbo Hu, Xiaoyu Li, Weihao Cheng, Jinwei Gu, Tianfan Xue, Ying Shan,
Abstract summary: NVComposer is a novel approach that eliminates the need for explicit external alignment.<n> NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks.<n>Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases.
Score: 50.36605863731669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views. In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from dense stereo models during training. Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems. Our project page is available at https://lg-li.github.io/project/nvcomposer

Related papers

Extrapolated Urban View Synthesis Benchmark [53.657271730352214]
Photo simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs) At their core is Novel View Synthesis (NVS), a capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. We will release the data to help advance self-driving and urban robotics simulation technology.
arXiv Detail & Related papers (2024-12-06T18:41:39Z)
Novel View Synthesis with Pixel-Space Diffusion Models [4.844800099745365]
generative models are being increasingly employed in novel view synthesis (NVS) We adapt a modern diffusion model architecture for end-to-end NVS in the pixel space. We introduce a novel NVS training scheme that utilizes single-view datasets, capitalizing on their relative abundance.
arXiv Detail & Related papers (2024-11-12T12:58:33Z)
Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z)
MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image. Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z)
NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer [48.57740681957145]
We propose a new novel view synthesis (NVS) paradigm that operates textitwithout the need for training. NVS-r adaptively modulates the diffusion sampling process with the given views to enable the creation of remarkable visual experiences.
arXiv Detail & Related papers (2024-05-24T08:56:19Z)
PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis [23.967904337714234]
We propose a set-based generative model that can simultaneously generate multiple, self-consistent new views. Our approach is not limited to generating a single image at a time and can condition on a variable number of views. We show that the model is capable of generating sets of views that have no natural ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.
arXiv Detail & Related papers (2024-02-28T02:06:11Z)
NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image [45.34977005820166]
NVS-Adapter is a plug-and-play module for a Text-to-Image (T2I) model. It synthesizes novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models. Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views.
arXiv Detail & Related papers (2023-12-12T14:29:57Z)
ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR) ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance. We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z)
Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models [16.326276673056334]
Consistent-1-to-3 is a generative framework that significantly mitigates this issue. We decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions. We propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information.
arXiv Detail & Related papers (2023-10-04T17:58:57Z)
Self-Supervised Visibility Learning for Novel View Synthesis [79.53158728483375]
Conventional rendering methods estimate scene geometry and synthesize novel views in two separate steps. We propose an end-to-end NVS framework to eliminate the error propagation issue. Our network is trained in an end-to-end self-supervised fashion, thus significantly alleviating error accumulation in view synthesis.
arXiv Detail & Related papers (2021-03-29T08:11:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.