A-STAR: Test-time Attention Segregation and Retention for Text-to-image
Synthesis
- URL: http://arxiv.org/abs/2306.14544v1
- Date: Mon, 26 Jun 2023 09:34:10 GMT
- Title: A-STAR: Test-time Attention Segregation and Retention for Text-to-image
Synthesis
- Authors: Aishwarya Agarwal and Srikrishna Karanam and K J Joseph and Apoorv
Saxena and Koustava Goswami and Balaji Vasan Srinivasan
- Abstract summary: We present two test-time attention-based loss functions for text-to-image generative models.
First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt.
Second, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps.
- Score: 24.159726798004748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent developments in text-to-image generative models have led to a
suite of high-performing methods capable of producing creative imagery from
free-form text, there are several limitations. By analyzing the cross-attention
representations of these models, we notice two key issues. First, for text
prompts that contain multiple concepts, there is a significant amount of
pixel-space overlap (i.e., same spatial regions) among pairs of different
concepts. This eventually leads to the model being unable to distinguish
between the two concepts and one of them being ignored in the final generation.
Next, while these models attempt to capture all such concepts during the
beginning of denoising (e.g., first few steps) as evidenced by cross-attention
maps, this knowledge is not retained by the end of denoising (e.g., last few
steps). Such loss of knowledge eventually leads to inaccurate generation
outputs. To address these issues, our key innovations include two test-time
attention-based loss functions that substantially improve the performance of
pretrained baseline text-to-image diffusion models. First, our attention
segregation loss reduces the cross-attention overlap between attention maps of
different concepts in the text prompt, thereby reducing the confusion/conflict
among various concepts and the eventual capture of all concepts in the
generated output. Next, our attention retention loss explicitly forces
text-to-image diffusion models to retain cross-attention information for all
concepts across all denoising time steps, thereby leading to reduced
information loss and the preservation of all concepts in the generated output.
Related papers
- CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis [8.386261591495103]
We introduce CoCoNO, a new algorithm that optimize the initial latent by leveraging the complementary information within self-attention and cross-attention maps.
Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented.
arXiv Detail & Related papers (2024-11-25T08:20:14Z) - Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion [50.26583654615212]
Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data.
In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting.
Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones.
arXiv Detail & Related papers (2024-11-08T12:58:48Z) - Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase.
We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships.
Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z) - AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization [4.544788024283586]
AttenCraft is an attention-guided method for multiple concept disentanglement.
We introduce Uniform sampling and Reweighted sampling schemes to alleviate the non-synchronicity of feature acquisition from different concepts.
Our method outperforms baseline models in terms of image-alignment, and behaves comparably on text-alignment.
arXiv Detail & Related papers (2024-05-28T08:50:14Z) - Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs)
Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one.
We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z) - Attention Calibration for Disentangled Text-to-Image Personalization [12.339742346826403]
We propose an attention calibration mechanism to improve the concept-level understanding of the T2I model.
We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-03-27T13:31:39Z) - Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention [62.671435607043875]
Research indicates that text-to-image diffusion models replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks.
We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens.
We introduce an innovative approach to detect and mitigate memorization in diffusion models.
arXiv Detail & Related papers (2024-03-17T01:27:00Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing
Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model.
We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance.
We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.