Compositional Text-to-Image Synthesis with Attention Map Control of
Diffusion Models
- URL: http://arxiv.org/abs/2305.13921v2
- Date: Wed, 13 Dec 2023 03:46:54 GMT
- Title: Compositional Text-to-Image Synthesis with Attention Map Control of
Diffusion Models
- Authors: Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin
- Abstract summary: Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts.
They fail to semantically align the generated images with the prompts due to their limited compositional capabilities.
We propose a novel attention mask control strategy based on predicted object boxes to address these issues.
- Score: 8.250234707160793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent text-to-image (T2I) diffusion models show outstanding performance in
generating high-quality images conditioned on textual prompts. However, they
fail to semantically align the generated images with the prompts due to their
limited compositional capabilities, leading to attribute leakage, entity
leakage, and missing entities. In this paper, we propose a novel attention mask
control strategy based on predicted object boxes to address these issues. In
particular, we first train a BoxNet to predict a box for each entity that
possesses the attribute specified in the prompt. Then, depending on the
predicted boxes, a unique mask control is applied to the cross- and
self-attention maps. Our approach produces a more semantically accurate
synthesis by constraining the attention regions of each token in the prompt to
the image. In addition, the proposed method is straightforward and effective
and can be readily integrated into existing cross-attention-based T2I
generators. We compare our approach to competing methods and demonstrate that
it can faithfully convey the semantics of the original text to the generated
content and achieve high availability as a ready-to-use plugin. Please refer to
https://github.com/OPPOMente-Lab/attention-mask-control.
Related papers
- Improving Text-guided Object Inpainting with Semantic Pre-inpainting [95.17396565347936]
We decompose the typical single-stage object inpainting into two cascaded processes: semantic pre-inpainting and high-fieldity object generation.
To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion framework.
arXiv Detail & Related papers (2024-09-12T17:55:37Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models [1.6450779686641077]
We introduce Open-Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models.
We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions.
arXiv Detail & Related papers (2024-03-21T10:56:12Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Masked-Attention Diffusion Guidance for Spatially Controlling
Text-to-Image Generation [1.0152838128195465]
We propose a method for spatially controlling text-to-image generation without further training of diffusion models.
Our aim is to control the attention maps according to given semantic masks and text prompts.
arXiv Detail & Related papers (2023-08-11T09:15:22Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z) - StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.