Directed Diffusion: Direct Control of Object Placement through Attention
Guidance
- URL: http://arxiv.org/abs/2302.13153v3
- Date: Tue, 26 Sep 2023 12:22:09 GMT
- Title: Directed Diffusion: Direct Control of Object Placement through Attention
Guidance
- Authors: Wan-Duo Kurt Ma, J.P. Lewis, Avisek Lahiri, Thomas Leung, W. Bastiaan
Kleijn
- Abstract summary: Text-guided diffusion models can generate an effectively endless variety of images given only a short text prompt describing the desired image content.
These models often struggle to compose scenes containing several key objects such as characters in specified positional relationships.
In this work, we take a particularly straightforward approach to providing the needed direction.
- Score: 15.275386705641266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable
Diffusion are able to generate an effectively endless variety of images given
only a short text prompt describing the desired image content. In many cases
the images are of very high quality. However, these models often struggle to
compose scenes containing several key objects such as characters in specified
positional relationships. The missing capability to ``direct'' the placement of
characters and objects both within and across images is crucial in
storytelling, as recognized in the literature on film and animation theory. In
this work, we take a particularly straightforward approach to providing the
needed direction. Drawing on the observation that the cross-attention maps for
prompt words reflect the spatial layout of objects denoted by those words, we
introduce an optimization objective that produces ``activation'' at desired
positions in these cross-attention maps. The resulting approach is a step
toward generalizing the applicability of text-guided diffusion models beyond
single images to collections of related images, as in storybooks. Directed
Diffusion provides easy high-level positional control over multiple objects,
while making use of an existing pre-trained model and maintaining a coherent
blend between the positioned objects and the background. Moreover, it requires
only a few lines to implement.
Related papers
- Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding [39.73180294057053]
We propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features.
We also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement.
arXiv Detail & Related papers (2024-09-12T17:48:22Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Towards Understanding Cross and Self-Attention in Stable Diffusion for
Text-Guided Image Editing [47.71851180196975]
tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers.
We conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information.
In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image.
arXiv Detail & Related papers (2024-03-06T03:32:56Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Masked-Attention Diffusion Guidance for Spatially Controlling
Text-to-Image Generation [1.0152838128195465]
We propose a method for spatially controlling text-to-image generation without further training of diffusion models.
Our aim is to control the attention maps according to given semantic masks and text prompts.
arXiv Detail & Related papers (2023-08-11T09:15:22Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.