Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
- URL: http://arxiv.org/abs/2401.08815v1
- Date: Tue, 16 Jan 2024 20:31:46 GMT
- Title: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
- Authors: Yumeng Li and Margret Keuper and Dan Zhang and Anna Khoreva
- Abstract summary: Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout.
We propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM)
Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout.
- Score: 21.49096276631859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent advances in large-scale diffusion models, little progress
has been made on the layout-to-image (L2I) synthesis task. Current L2I models
either suffer from poor editability via text or weak alignment between the
generated image and the input layout. This limits their usability in practice.
To mitigate this, we propose to integrate adversarial supervision into the
conventional training pipeline of L2I diffusion models (ALDM). Specifically, we
employ a segmentation-based discriminator which provides explicit feedback to
the diffusion generator on the pixel-level alignment between the denoised image
and the input layout. To encourage consistent adherence to the input layout
over the sampling steps, we further introduce the multistep unrolling strategy.
Instead of looking at a single timestep, we unroll a few steps recursively to
imitate the inference process, and ask the discriminator to assess the
alignment of denoised images with the layout over a certain time window. Our
experiments show that ALDM enables layout faithfulness of the generated images,
while allowing broad editability via text prompts. Moreover, we showcase its
usefulness for practical applications: by synthesizing target distribution
samples via text control, we improve domain generalization of semantic
segmentation models by a large margin (~12 mIoU points).
Related papers
- Boundary Attention Constrained Zero-Shot Layout-To-Image Generation [47.435234391588494]
Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting.
We propose a novel zero-shot L2I approach, BACON, which eliminates the need for additional modules or fine-tuning.
We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Referee Can Play: An Alternative Approach to Conditional Generation via
Model Inversion [35.21106030549071]
Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks.
We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs)
By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
arXiv Detail & Related papers (2024-02-26T05:08:40Z) - Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.
We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework.
We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - Grounded Text-to-Image Synthesis with Attention Refocusing [16.9170825951175]
We reveal the potential causes in the diffusion model's cross-attention and self-attention layers.
We propose two novel losses to refocus attention maps according to a given spatial layout during sampling.
We show that our proposed attention refocusing effectively improves the controllability of existing approaches.
arXiv Detail & Related papers (2023-06-08T17:59:59Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - LayoutDiffusion: Improving Graphic Layout Generation by Discrete
Diffusion Probabilistic Models [50.73105631853759]
We present a novel generative model named LayoutDiffusion for automatic layout generation.
It learns to reverse a mild forward process, in which layouts become increasingly chaotic with the growth of forward steps.
It enables two conditional layout generation tasks in a plug-and-play manner without re-training and achieves better performance than existing methods.
arXiv Detail & Related papers (2023-03-21T04:41:02Z) - Blended Latent Diffusion [18.043090347648157]
We present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask.
Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space.
arXiv Detail & Related papers (2022-06-06T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.