Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models
- URL: http://arxiv.org/abs/2403.06381v1
- Date: Mon, 11 Mar 2024 02:18:27 GMT
- Title: Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models
- Authors: Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, Tiviatis Sim, Kenji Kawaguchi
- Abstract summary: Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
- Score: 23.786473791344395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in diffusion models have notably improved the perceptual
quality of generated images in text-to-image synthesis tasks. However,
diffusion models often struggle to produce images that accurately reflect the
intended semantics of the associated text prompts. We examine cross-attention
layers in diffusion models and observe a propensity for these layers to
disproportionately focus on certain tokens during the generation process,
thereby undermining semantic fidelity. To address the issue of dominant
attention, we introduce attention regulation, a computation-efficient
on-the-fly optimization approach at inference time to align attention maps with
the input text prompt. Notably, our method requires no additional training or
fine-tuning and serves as a plug-in module on a model. Hence, the generation
capacity of the original model is fully preserved. We compare our approach with
alternative approaches across various datasets, evaluation metrics, and
diffusion models. Experiment results show that our method consistently
outperforms other baselines, yielding images that more faithfully reflect the
desired concepts with reduced computation overhead. Code is available at
https://github.com/YaNgZhAnG-V5/attention_regulation.
Related papers
- LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes.
Deep generative models, including diffusion models, are biased towards classes with abundant training images.
We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - Semantic Guidance Tuning for Text-To-Image Diffusion Models [3.3881449308956726]
We propose a training-free approach that modulates the guidance direction of diffusion models during inference.
We first decompose the prompt semantics into a set of concepts, and monitor the guidance trajectory in relation to each concept.
Based on this observation, we devise a technique to steer the guidance direction towards any concept from which the model diverges.
arXiv Detail & Related papers (2023-12-26T09:02:17Z) - Aligning Text-to-Image Diffusion Models with Reward Backpropagation [62.45086888512723]
We propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient.
We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler.
arXiv Detail & Related papers (2023-10-05T17:59:18Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [80.82832715884597]
We introduce the new task of predicting the text prompt given an image generated by a generative diffusion model.
We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.