Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models
- URL: http://arxiv.org/abs/2403.06381v1
- Date: Mon, 11 Mar 2024 02:18:27 GMT
- Title: Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models
- Authors: Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, Tiviatis Sim, Kenji Kawaguchi
- Abstract summary: Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
- Score: 23.786473791344395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in diffusion models have notably improved the perceptual
quality of generated images in text-to-image synthesis tasks. However,
diffusion models often struggle to produce images that accurately reflect the
intended semantics of the associated text prompts. We examine cross-attention
layers in diffusion models and observe a propensity for these layers to
disproportionately focus on certain tokens during the generation process,
thereby undermining semantic fidelity. To address the issue of dominant
attention, we introduce attention regulation, a computation-efficient
on-the-fly optimization approach at inference time to align attention maps with
the input text prompt. Notably, our method requires no additional training or
fine-tuning and serves as a plug-in module on a model. Hence, the generation
capacity of the original model is fully preserved. We compare our approach with
alternative approaches across various datasets, evaluation metrics, and
diffusion models. Experiment results show that our method consistently
outperforms other baselines, yielding images that more faithfully reflect the
desired concepts with reduced computation overhead. Code is available at
https://github.com/YaNgZhAnG-V5/attention_regulation.
Related papers
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Enhancing Consistency-Based Image Generation via Adversarialy-Trained Classification and Energy-Based Discrimination [13.238373528922194]
We propose a novel technique for post-processing Consistency-based generated images, enhancing their perceptual quality.
Our approach utilizes a joint classifier-discriminator model, in which both portions are trained adversarially.
By employing example-specific projected gradient under the guidance of this joint machine, we refine synthesized images and achieve an improved FID scores on the ImageNet 64x64 dataset.
arXiv Detail & Related papers (2024-05-25T14:53:52Z) - LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes.
Deep generative models, including diffusion models, are biased towards classes with abundant training images.
We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z) - Semantic Guidance Tuning for Text-To-Image Diffusion Models [3.3881449308956726]
We propose a training-free approach that modulates the guidance direction of diffusion models during inference.
We first decompose the prompt semantics into a set of concepts, and monitor the guidance trajectory in relation to each concept.
Based on this observation, we devise a technique to steer the guidance direction towards any concept from which the model diverges.
arXiv Detail & Related papers (2023-12-26T09:02:17Z) - JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement [69.6035373784027]
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models.
Previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy.
We propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition.
arXiv Detail & Related papers (2023-12-20T08:05:57Z) - Aligning Text-to-Image Diffusion Models with Reward Backpropagation [62.45086888512723]
We propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient.
We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler.
arXiv Detail & Related papers (2023-10-05T17:59:18Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.