Multi-View Unsupervised Image Generation with Cross Attention Guidance
- URL: http://arxiv.org/abs/2312.04337v1
- Date: Thu, 7 Dec 2023 14:55:13 GMT
- Title: Multi-View Unsupervised Image Generation with Cross Attention Guidance
- Authors: Llukman Cerkezi, Aram Davtyan, Sepehr Sameni, Paolo Favaro
- Abstract summary: This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets.
We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts.
Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
- Score: 23.07929124170851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing interest in novel view synthesis, driven by Neural Radiance Field
(NeRF) models, is hindered by scalability issues due to their reliance on
precisely annotated multi-view images. Recent models address this by
fine-tuning large text2image diffusion models on synthetic multi-view data.
Despite robust zero-shot generalization, they may need post-processing and can
face quality issues due to the synthetic-real domain gap. This paper introduces
a novel pipeline for unsupervised training of a pose-conditioned diffusion
model on single-category datasets. With the help of pretrained self-supervised
Vision Transformers (DINOv2), we identify object poses by clustering the
dataset through comparing visibility and locations of specific object parts.
The pose-conditioned diffusion model, trained on pose labels, and equipped with
cross-frame attention at inference time ensures cross-view consistency, that is
further aided by our novel hard-attention guidance. Our model, MIRAGE,
surpasses prior work in novel view synthesis on real images. Furthermore,
MIRAGE is robust to diverse textures and geometries, as demonstrated with our
experiments on synthetic images generated with pretrained Stable Diffusion.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - Improving Few-shot Image Generation by Structural Discrimination and
Textural Modulation [10.389698647141296]
Few-shot image generation aims to produce plausible and diverse images for one category given a few images from this category.
Existing approaches either globally interpolate different images or fuse local representations with pre-defined coefficients.
This paper proposes a novel mechanism to inject external semantic signals into internal local representations.
arXiv Detail & Related papers (2023-08-30T16:10:21Z) - Consistent View Synthesis with Pose-Guided Diffusion Models [51.37925069307313]
Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications.
We propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image.
arXiv Detail & Related papers (2023-03-30T17:59:22Z) - GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from
Multi-view Images [79.39247661907397]
We introduce an effective framework Generalizable Model-based Neural Radiance Fields to synthesize free-viewpoint images.
Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy.
arXiv Detail & Related papers (2023-03-24T03:32:02Z) - Object-Centric Slot Diffusion [30.722428924152382]
We introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes.
We demonstrate that LSD significantly outperforms state-of-the-art transformer-based decoders.
We also conduct a preliminary investigation into the integration of pre-trained diffusion models in LSD.
arXiv Detail & Related papers (2023-03-20T02:40:16Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models [33.69732363040526]
We propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images.
This is the first work successfully leveraging diffusion models for coherent visual story synthesizing.
arXiv Detail & Related papers (2022-11-20T11:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.