Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2304.03869v1
- Date: Fri, 7 Apr 2023 23:49:34 GMT
- Title: Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis
- Authors: Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui, Zhe Lin, Yang Zhang,
Shiyu Chang
- Abstract summary: Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
- Score: 59.10787643285506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based models have achieved state-of-the-art performance on
text-to-image synthesis tasks. However, one critical limitation of these models
is the low fidelity of generated images with respect to the text description,
such as missing objects, mismatched attributes, and mislocated objects. One key
reason for such inconsistencies is the inaccurate cross-attention to text in
both the spatial dimension, which controls at what pixel region an object
should appear, and the temporal dimension, which controls how different levels
of details are added through the denoising steps. In this paper, we propose a
new text-to-image algorithm that adds explicit control over spatial-temporal
cross-attention in diffusion models. We first utilize a layout predictor to
predict the pixel regions for objects mentioned in the text. We then impose
spatial attention control by combining the attention over the entire text
description and that over the local description of the particular object in the
corresponding pixel region of that object. The temporal attention control is
further added by allowing the combination weights to change at each denoising
step, and the combination weights are optimized to ensure high fidelity between
the image and the text. Experiments show that our method generates images with
higher fidelity compared to diffusion-model-based baselines without fine-tuning
the diffusion model. Our code is publicly available at
https://github.com/UCSB-NLP-Chang/Diffusion-SpaceTime-Attn.
Related papers
- DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - AID: Attention Interpolation of Text-to-Image Diffusion [64.87754163416241]
We introduce a training-free technique named Attention Interpolation via Diffusion (AID)
AID fuses the interpolated attention with self-attention to boost fidelity.
We also present a variant, Conditional-guided Attention Interpolation via Diffusion (AID), that considers as a condition-dependent generative process.
arXiv Detail & Related papers (2024-03-26T17:57:05Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Improving Compositional Text-to-image Generation with Large
Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts.
We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts.
Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Masked-Attention Diffusion Guidance for Spatially Controlling
Text-to-Image Generation [1.0152838128195465]
We propose a method for spatially controlling text-to-image generation without further training of diffusion models.
Our aim is to control the attention maps according to given semantic masks and text prompts.
arXiv Detail & Related papers (2023-08-11T09:15:22Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.