Related papers: Multi-View Unsupervised Image Generation with Cross Attention Guidance

Multi-View Unsupervised Image Generation with Cross Attention Guidance

URL: http://arxiv.org/abs/2312.04337v1
Date: Thu, 7 Dec 2023 14:55:13 GMT
Title: Multi-View Unsupervised Image Generation with Cross Attention Guidance
Authors: Llukman Cerkezi, Aram Davtyan, Sepehr Sameni, Paolo Favaro
Abstract summary: This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
Score: 23.07929124170851
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing interest in novel view synthesis, driven by Neural Radiance Field (NeRF) models, is hindered by scalability issues due to their reliance on precisely annotated multi-view images. Recent models address this by fine-tuning large text2image diffusion models on synthetic multi-view data. Despite robust zero-shot generalization, they may need post-processing and can face quality issues due to the synthetic-real domain gap. This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. With the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. The pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.

Related papers

WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image [3.4248731707266264]
This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules.<n>Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization.<n>Our method improves view consistency across various diffusion models, demonstrating its broader applicability.
arXiv Detail & Related papers (2025-06-30T05:00:47Z)
GAS: Generative Avatar Synthesis from a Single Image [54.95198111659466]
We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model.
arXiv Detail & Related papers (2025-02-10T19:00:39Z)
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes [35.16430027877207]
MOVIS aims to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS. We introduce an auxiliary task requiring the model to simultaneously predict novel view object masks. Our method exhibits strong generalization capabilities and produces consistent novel view synthesis.
arXiv Detail & Related papers (2024-12-16T05:23:45Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image. Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z)
SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning. The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z)
Improving Few-shot Image Generation by Structural Discrimination and Textural Modulation [10.389698647141296]
Few-shot image generation aims to produce plausible and diverse images for one category given a few images from this category. Existing approaches either globally interpolate different images or fuse local representations with pre-defined coefficients. This paper proposes a novel mechanism to inject external semantic signals into internal local representations.
arXiv Detail & Related papers (2023-08-30T16:10:21Z)
Consistent View Synthesis with Pose-Guided Diffusion Models [51.37925069307313]
Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications. We propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image.
arXiv Detail & Related papers (2023-03-30T17:59:22Z)
GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images [79.39247661907397]
We introduce an effective framework Generalizable Model-based Neural Radiance Fields to synthesize free-viewpoint images. Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy.
arXiv Detail & Related papers (2023-03-24T03:32:02Z)
Object-Centric Slot Diffusion [30.722428924152382]
We introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes. We demonstrate that LSD significantly outperforms state-of-the-art transformer-based decoders. We also conduct a preliminary investigation into the integration of pre-trained diffusion models in LSD.
arXiv Detail & Related papers (2023-03-20T02:40:16Z)
Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis. Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z)
Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models [33.69732363040526]
We propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images. This is the first work successfully leveraging diffusion models for coherent visual story synthesizing.
arXiv Detail & Related papers (2022-11-20T11:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.