Related papers: Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

URL: http://arxiv.org/abs/2403.16990v1
Date: Mon, 25 Mar 2024 17:52:07 GMT
Title: Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
Authors: Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or,
Abstract summary: We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
Score: 60.943159830780154
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

Related papers

In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation [41.79836820271156]
"In-Context Brush" is a zero-shot framework for customized subject insertion.<n>We formulate the object image and the textual prompts as cross-modal demonstrations.<n>The goal is to inpaint the target image with the subject aligning textual prompts without model tuning.
arXiv Detail & Related papers (2025-05-26T17:49:10Z)
Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z)
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z)
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention [62.671435607043875]
Research indicates that text-to-image diffusion models replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks. We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens. We introduce an innovative approach to detect and mitigate memorization in diffusion models.
arXiv Detail & Related papers (2024-03-17T01:27:00Z)
PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering [13.785484396436367]
We formulate image composition as a subject-based local editing task, solely focusing on foreground generation. We propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-03-08T04:58:49Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images. Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept. We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z)
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks [64.67735676127208]
Text-to-image diffusion models have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. We introduce customized solutions by fully exploiting the aforementioned free attention masks.
arXiv Detail & Related papers (2023-08-13T10:07:46Z)
Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.