Masked-Attention Diffusion Guidance for Spatially Controlling
Text-to-Image Generation
- URL: http://arxiv.org/abs/2308.06027v2
- Date: Mon, 30 Oct 2023 04:48:26 GMT
- Title: Masked-Attention Diffusion Guidance for Spatially Controlling
Text-to-Image Generation
- Authors: Yuki Endo
- Abstract summary: We propose a method for spatially controlling text-to-image generation without further training of diffusion models.
Our aim is to control the attention maps according to given semantic masks and text prompts.
- Score: 1.0152838128195465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis has achieved high-quality results with recent
advances in diffusion models. However, text input alone has high spatial
ambiguity and limited user controllability. Most existing methods allow spatial
control through additional visual guidance (e.g., sketches and semantic masks)
but require additional training with annotated images. In this paper, we
propose a method for spatially controlling text-to-image generation without
further training of diffusion models. Our method is based on the insight that
the cross-attention maps reflect the positional relationship between words and
pixels. Our aim is to control the attention maps according to given semantic
masks and text prompts. To this end, we first explore a simple approach of
directly swapping the cross-attention maps with constant maps computed from the
semantic regions. Some prior works also allow training-free spatial control of
text-to-image diffusion models by directly manipulating cross-attention maps.
However, these approaches still suffer from misalignment to given masks because
manipulated attention maps are far from actual ones learned by diffusion
models. To address this issue, we propose masked-attention guidance, which can
generate images more faithful to semantic masks via indirect control of
attention to each word and pixel by manipulating noise images fed to diffusion
models. Masked-attention guidance can be easily integrated into pre-trained
off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the
tasks of text-guided image editing. Experiments show that our method enables
more accurate spatial control than baselines qualitatively and quantitatively.
Related papers
- Scribble-Guided Diffusion for Training-free Text-to-Image Generation [17.930032337081673]
Scribble-Guided Diffusion (ScribbleDiff) is a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation.
We introduce moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs.
arXiv Detail & Related papers (2024-09-12T13:13:07Z) - EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training.
Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps.
In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z) - NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of
Interpretable Directions in Diffusion Models [6.254873489691852]
We propose an unsupervised method to discover latent semantics in text-to-image diffusion models without relying on text prompts.
Our method achieves highly disentangled edits, outperforming existing approaches in both diffusion-based and GAN-based latent space editing methods.
arXiv Detail & Related papers (2023-12-08T22:04:53Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images
with Free Attention Masks [64.67735676127208]
Text-to-image diffusion models have shown great potential for benefiting image recognition.
Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images.
We introduce customized solutions by fully exploiting the aforementioned free attention masks.
arXiv Detail & Related papers (2023-08-13T10:07:46Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Compositional Text-to-Image Synthesis with Attention Map Control of
Diffusion Models [8.250234707160793]
Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts.
They fail to semantically align the generated images with the prompts due to their limited compositional capabilities.
We propose a novel attention mask control strategy based on predicted object boxes to address these issues.
arXiv Detail & Related papers (2023-05-23T10:49:22Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z) - Directed Diffusion: Direct Control of Object Placement through Attention
Guidance [15.275386705641266]
Text-guided diffusion models can generate an effectively endless variety of images given only a short text prompt describing the desired image content.
These models often struggle to compose scenes containing several key objects such as characters in specified positional relationships.
In this work, we take a particularly straightforward approach to providing the needed direction.
arXiv Detail & Related papers (2023-02-25T20:48:15Z) - Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model.
Our method does not require to train a dedicated model or a specialized encoder for the task.
We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.