Multi-View Unsupervised Image Generation with Cross Attention Guidance
- URL: http://arxiv.org/abs/2312.04337v1
- Date: Thu, 7 Dec 2023 14:55:13 GMT
- Title: Multi-View Unsupervised Image Generation with Cross Attention Guidance
- Authors: Llukman Cerkezi, Aram Davtyan, Sepehr Sameni, Paolo Favaro
- Abstract summary: This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets.
We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts.
Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
- Score: 23.07929124170851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growing interest in novel view synthesis, driven by Neural Radiance Field
(NeRF) models, is hindered by scalability issues due to their reliance on
precisely annotated multi-view images. Recent models address this by
fine-tuning large text2image diffusion models on synthetic multi-view data.
Despite robust zero-shot generalization, they may need post-processing and can
face quality issues due to the synthetic-real domain gap. This paper introduces
a novel pipeline for unsupervised training of a pose-conditioned diffusion
model on single-category datasets. With the help of pretrained self-supervised
Vision Transformers (DINOv2), we identify object poses by clustering the
dataset through comparing visibility and locations of specific object parts.
The pose-conditioned diffusion model, trained on pose labels, and equipped with
cross-frame attention at inference time ensures cross-view consistency, that is
further aided by our novel hard-attention guidance. Our model, MIRAGE,
surpasses prior work in novel view synthesis on real images. Furthermore,
MIRAGE is robust to diverse textures and geometries, as demonstrated with our
experiments on synthetic images generated with pretrained Stable Diffusion.
Related papers
- GAS: Generative Avatar Synthesis from a Single Image [54.95198111659466]
We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image.
Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model.
arXiv Detail & Related papers (2025-02-10T19:00:39Z) - MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes [35.16430027877207]
MOVIS aims to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS.
We introduce an auxiliary task requiring the model to simultaneously predict novel view object masks.
To evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement.
arXiv Detail & Related papers (2024-12-16T05:23:45Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from
Multi-view Images [79.39247661907397]
We introduce an effective framework Generalizable Model-based Neural Radiance Fields to synthesize free-viewpoint images.
Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy.
arXiv Detail & Related papers (2023-03-24T03:32:02Z) - Object-Centric Slot Diffusion [30.722428924152382]
We introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes.
We demonstrate that LSD significantly outperforms state-of-the-art transformer-based decoders.
We also conduct a preliminary investigation into the integration of pre-trained diffusion models in LSD.
arXiv Detail & Related papers (2023-03-20T02:40:16Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models [33.69732363040526]
We propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images.
This is the first work successfully leveraging diffusion models for coherent visual story synthesizing.
arXiv Detail & Related papers (2022-11-20T11:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.