NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image
- URL: http://arxiv.org/abs/2312.07315v2
- Date: Sat, 10 Aug 2024 07:07:35 GMT
- Title: NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image
- Authors: Yoonwoo Jeong, Jinwoo Lee, Chiheon Kim, Minsu Cho, Doyup Lee,
- Abstract summary: NVS-Adapter is a plug-and-play module for a Text-to-Image (T2I) model.
It synthesizes novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models.
Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views.
- Score: 45.34977005820166
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transfer learning of large-scale Text-to-Image (T2I) models has recently shown impressive potential for Novel View Synthesis (NVS) of diverse objects from a single image. While previous methods typically train large models on multi-view datasets for NVS, fine-tuning the whole parameters of T2I models not only demands a high cost but also reduces the generalization capacity of T2I models in generating diverse images in a new domain. In this study, we propose an effective method, dubbed NVS-Adapter, which is a plug-and-play module for a T2I model, to synthesize novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models. NVS-Adapter consists of two main components; view-consistency cross-attention learns the visual correspondences to align the local details of view features, and global semantic conditioning aligns the semantic structure of generated views with the reference view. Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views and also achieve high performance on benchmarks without full fine-tuning of T2I models. The code and data are publicly available in ~\href{https://postech-cvlab.github.io/nvsadapter/}{https://postech-cvlab.github.io/nvsadapter/}.
Related papers
- NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images [50.36605863731669]
NVComposer is a novel approach that eliminates the need for explicit external alignment.
NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks.
Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases.
arXiv Detail & Related papers (2024-12-04T17:58:03Z) - MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model [87.71060849866093]
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks.
Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses.
We present several training and model modifications to strengthen the model with scaled-up datasets.
arXiv Detail & Related papers (2024-11-25T07:34:23Z) - TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation [22.782099757385804]
TIP-I2V is the first large-scale dataset of user-provided text and image prompts for image-to-video generation.
We provide the corresponding generated videos from five state-of-the-art image-to-video models.
arXiv Detail & Related papers (2024-11-05T18:52:43Z) - SingleInsert: Inserting New Concepts from a Single Image into
Text-to-Image Models for Flexible Editing [59.3017821001455]
SingleInsert is an image-to-text (I2T) inversion method with single source images containing the same concept.
In this work, we propose a simple and effective baseline for single-image I2T inversion, named SingleInsert.
With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing.
arXiv Detail & Related papers (2023-10-12T07:40:39Z) - MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval [60.454321238910474]
State-of-the-art video-text retrieval methods typically involve fully fine-tuning a pre-trained model on specific datasets.
We present our pioneering work that enables parameter-efficient VTR using a pre-trained model.
We propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text.
arXiv Detail & Related papers (2023-01-19T03:42:56Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z) - Deep View Synthesis via Self-Consistent Generative Network [41.34461086700849]
View synthesis aims to produce unseen views from a set of views captured by two or more cameras at different positions.
To address this issue, most existing methods seek to exploit the geometric information to match pixels.
We propose a novel deep generative model, called Self-Consistent Generative Network (SCGN), which synthesizes novel views without explicitly exploiting the geometric information.
arXiv Detail & Related papers (2021-01-19T10:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.