MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
- URL: http://arxiv.org/abs/2309.04399v1
- Date: Fri, 8 Sep 2023 15:53:37 GMT
- Title: MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
- Authors: Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou,
Jiashi Feng
- Abstract summary: A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
- Score: 84.84034179136458
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in diffusion models have showcased their impressive
capacity to generate visually striking images. Nevertheless, ensuring a close
match between the generated image and the given prompt remains a persistent
challenge. In this work, we identify that a crucial factor leading to the
text-image mismatch issue is the inadequate cross-modality relation learning
between the prompt and the output image. To better align the prompt and image
content, we advance the cross-attention with an adaptive mask, which is
conditioned on the attention maps and the prompt embeddings, to dynamically
adjust the contribution of each text token to the image features. This
mechanism explicitly diminishes the ambiguity in semantic information embedding
from the text encoder, leading to a boost of text-to-image consistency in the
synthesized images. Our method, termed MaskDiffusion, is training-free and
hot-pluggable for popular pre-trained diffusion models. When applied to the
latent diffusion models, our MaskDiffusion can significantly improve the
text-to-image consistency with negligible computation overhead compared to the
original diffusion models.
Related papers
- Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase.
We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships.
Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
Diffusion Models [46.18013380882767]
This work focuses on inverting the diffusion model to obtain interpretable language prompts directly.
We leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.
We show that our approach can identify semantically interpretable and meaningful prompts for a target image.
arXiv Detail & Related papers (2023-12-19T18:47:30Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image
Diffusion Models [103.61066310897928]
Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt.
While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt.
We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt.
We introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness
arXiv Detail & Related papers (2023-01-31T18:10:38Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.