Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering
- URL: http://arxiv.org/abs/2405.18677v2
- Date: Thu, 24 Oct 2024 12:51:54 GMT
- Title: Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering
- Authors: Ido Sobol, Chenfeng Xu, Or Litany,
- Abstract summary: We propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps.
We modify the self-attention mechanism to integrate information from the source view, reducing shape distortions.
Results demonstrate substantial improvements in fidelity and consistency, validated on a diverse set of out-of-distribution objects.
- Score: 16.382098950820822
- License:
- Abstract: Generating realistic images from arbitrary views based on a single source image remains a significant challenge in computer vision, with broad applications ranging from e-commerce to immersive virtual experiences. Recent advancements in diffusion models, particularly the Zero-1-to-3 model, have been widely adopted for generating plausible views, videos, and 3D models. However, these models still struggle with inconsistencies and implausibility in new views generation, especially for challenging changes in viewpoint. In this work, we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity. This process improves geometric consistency without requiring retraining or significant computational resources. Additionally, we modify the self-attention mechanism to integrate information from the source view, reducing shape distortions. These processes are further supported by a specialized sampling schedule. Experimental results demonstrate substantial improvements in fidelity and consistency, validated on a diverse set of out-of-distribution objects. Additionally, we demonstrate the general applicability and effectiveness of Zero-to-Hero in multi-view, and image generation conditioned on semantic maps and pose.
Related papers
- NovelGS: Consistent Novel-view Denoising via Large Gaussian Reconstruction Model [57.92709692193132]
NovelGS is a diffusion model for Gaussian Splatting given sparse-view images.
We leverage the novel view denoising through a transformer-based network to generate 3D Gaussians.
arXiv Detail & Related papers (2024-11-25T07:57:17Z) - Fixing the Perspective: A Critical Examination of Zero-1-to-3 [0.0]
We investigate Zero-1-to-3's cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet.
We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the cross-attention mechanism, and (2) an enhanced architecture that can leverage multiple conditional views simultaneously.
arXiv Detail & Related papers (2024-11-24T04:21:51Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - Learning Robust Generalizable Radiance Field with Visibility and Feature
Augmented Point Representation [7.203073346844801]
This paper introduces a novel paradigm for the generalizable neural radiance field (NeRF)
We propose the first paradigm that constructs the generalizable neural field based on point-based rather than image-based rendering.
Our approach explicitly models visibilities by geometric priors and augments them with neural features.
arXiv Detail & Related papers (2024-01-25T17:58:51Z) - DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis [18.64688172651478]
We present DiffPortrait3D, a conditional diffusion model capable of synthesizing 3D-consistent photo-realistic novel views.
Given a single RGB input, we aim to synthesize plausible but consistent facial details rendered from novel camera views.
We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.
arXiv Detail & Related papers (2023-12-20T13:31:11Z) - Consistent123: Improve Consistency for One Image to 3D Object Synthesis [74.1094516222327]
Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability.
These models have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation.
We propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism.
arXiv Detail & Related papers (2023-10-12T07:38:28Z) - Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models [16.326276673056334]
Consistent-1-to-3 is a generative framework that significantly mitigates this issue.
We decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions.
We propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information.
arXiv Detail & Related papers (2023-10-04T17:58:57Z) - Efficient-3DiM: Learning a Generalizable Single-image Novel-view
Synthesizer in One Day [63.96075838322437]
We propose a framework to learn a single-image novel-view synthesizer.
Our framework is able to reduce the total training time from 10 days to less than 1 day.
arXiv Detail & Related papers (2023-10-04T17:57:07Z) - Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction [60.52716381465063]
We introduce Deceptive-NeRF/3DGS to enhance sparse-view reconstruction with only a limited set of input images.
Specifically, we propose a deceptive diffusion model turning noisy images rendered from few-view reconstructions into high-quality pseudo-observations.
Our system progressively incorporates diffusion-generated pseudo-observations into the training image sets, ultimately densifying the sparse input observations by 5 to 10 times.
arXiv Detail & Related papers (2023-05-24T14:00:32Z) - GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from
Multi-view Images [79.39247661907397]
We introduce an effective framework Generalizable Model-based Neural Radiance Fields to synthesize free-viewpoint images.
Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy.
arXiv Detail & Related papers (2023-03-24T03:32:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.