Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models
- URL: http://arxiv.org/abs/2301.13826v2
- Date: Wed, 31 May 2023 15:42:00 GMT
- Title: Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models
- Authors: Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, Daniel Cohen-Or
- Abstract summary: Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt.
While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt.
We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt.
We introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness
- Score: 103.61066310897928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image generative models have demonstrated an unparalleled
ability to generate diverse and creative imagery guided by a target text
prompt. While revolutionary, current state-of-the-art diffusion models may
still fail in generating images that fully convey the semantics in the given
text prompt. We analyze the publicly available Stable Diffusion model and
assess the existence of catastrophic neglect, where the model fails to generate
one or more of the subjects from the input prompt. Moreover, we find that in
some cases the model also fails to correctly bind attributes (e.g., colors) to
their corresponding subjects. To help mitigate these failure cases, we
introduce the concept of Generative Semantic Nursing (GSN), where we seek to
intervene in the generative process on the fly during inference time to improve
the faithfulness of the generated images. Using an attention-based formulation
of GSN, dubbed Attend-and-Excite, we guide the model to refine the
cross-attention units to attend to all subject tokens in the text prompt and
strengthen - or excite - their activations, encouraging the model to generate
all subjects described in the text prompt. We compare our approach to
alternative approaches and demonstrate that it conveys the desired concepts
more faithfully across a range of text prompts.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models [20.19571676239579]
We introduce a novel diffusion-based framework to enhance the alignment of generated images with their corresponding descriptions.
Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image.
We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt.
arXiv Detail & Related papers (2024-06-24T06:12:16Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - DE-FAKE: Detection and Attribution of Fake Images Generated by
Text-to-Image Diffusion Models [12.310393737912412]
We pioneer a systematic study of the authenticity of fake images generated by text-to-image diffusion models.
For visual modality, we propose universal detection that demonstrates fake images of these text-to-image diffusion models share common cues.
For linguistic modality, we analyze the impacts of text captions on the image authenticity of text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-13T13:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.