Related papers: Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

URL: http://arxiv.org/abs/2408.15660v1
Date: Wed, 28 Aug 2024 09:22:32 GMT
Title: Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas
Authors: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara,
Abstract summary: We propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence.
Score: 33.334956022229846
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.

Related papers

Reversible Efficient Diffusion for Image Fusion [66.35113261837469]
Multi-modal image fusion aims to consolidate complementary information from diverse source images into a unified representation.<n>While diffusion models have demonstrated impressive generative capabilities in image generation, they often suffer from detail loss when applied to image fusion tasks.<n>This issue arises from the accumulation of noise errors inherent in the Markov process, leading to inconsistency and degradation in the fused results.<n>We propose the Reversible Efficient Diffusion (RED) model - an explicitly supervised training framework that inherits the powerful generative capability of diffusion models while avoiding the distribution estimation.
arXiv Detail & Related papers (2026-01-28T05:14:55Z)
Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing [25.138589492384654]
We propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$2$D.<n>We first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of images.<n>We integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing.
arXiv Detail & Related papers (2025-09-24T13:11:37Z)
From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion [98.31811240195324]
ConFill is a novel framework that reduces discrepancies between generated and original images at each diffusion step. It outperforms current methods, setting a new benchmark in image completion.
arXiv Detail & Related papers (2025-04-19T13:40:46Z)
Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion [4.0301593672451]
Diffusion Prism is a training-free framework that transforms binary masks into realistic and diverse samples. We explore that a small amount of artificial noise will significantly assist the image-denoising process.
arXiv Detail & Related papers (2025-01-01T20:04:25Z)
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution [48.88184541515326]
We propose a simple and effective method, named FaithDiff, to fully harness the power of latent diffusion models (LDMs) for faithful image SR. In contrast to existing diffusion-based SR methods that freeze the diffusion model pre-trained on high-quality images, we propose to unleash the diffusion prior to identify useful information and recover faithful structures.
arXiv Detail & Related papers (2024-11-27T23:58:03Z)
Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method [60.88467353578118]
We show that a fixed-point-inspired iterative approach to invert real-world images does not achieve convergence, instead oscillating between distinct clusters. We introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing.
arXiv Detail & Related papers (2024-11-17T17:45:37Z)
FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior [50.0535198082903]
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. We showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition.
arXiv Detail & Related papers (2024-07-06T03:35:43Z)
MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z)
ViewFusion: Towards Multi-View Consistency via Interpolated Denoising [48.02829400913904]
We introduce ViewFusion, a training-free algorithm that can be seamlessly integrated into existing pre-trained diffusion models. Our approach adopts an auto-regressive method that implicitly leverages previously generated views as context for the next view generation. Our framework successfully extends single-view conditioned models to work in multiple-view conditional settings without any additional fine-tuning.
arXiv Detail & Related papers (2024-02-29T04:21:38Z)
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS) A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z)
On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt. Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z)
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing [28.593023489682654]
We present DiffMorpher, the first approach enabling smooth and natural image morphing using diffusion models. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively, and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition. In addition, we propose an attention and injection technique and a new sampling schedule to further enhance the smoothness between consecutive images.
arXiv Detail & Related papers (2023-12-12T16:28:08Z)
Real-World Image Variation by Aligning Diffusion Inversion Chain [53.772004619296794]
A domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. We propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain.
arXiv Detail & Related papers (2023-05-30T04:09:47Z)
SinDiffusion: Learning a Diffusion Model from a Single Natural Image [159.4285444680301]
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image. It is based on two core designs. First, SinDiffusion is trained with a single model at a single scale instead of multiple models with progressive growing of scales. Second, we identify that a patch-level receptive field of the diffusion network is crucial and effective for capturing the image's patch statistics.
arXiv Detail & Related papers (2022-11-22T18:00:03Z)
On Conditioning the Input Noise for Controlled Image Generation with Diffusion Models [27.472482893004862]
Conditional image generation has paved the way for several breakthroughs in image editing, generating stock photos and 3-D object generation. In this work, we explore techniques to condition diffusion models with carefully crafted input noise artifacts.
arXiv Detail & Related papers (2022-05-08T13:18:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.