Related papers: Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models

Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models

URL: http://arxiv.org/abs/2406.07844v1
Date: Wed, 12 Jun 2024 03:21:34 GMT
Title: Understanding and Mitigating Compositional Issues in Text-to-Image Generative Models
Authors: Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kattakinda, Soheil Feizi,
Abstract summary: We show that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of text-to-image models to generate high-fidelity compositional scenes. Our main finding shows that the best compositional improvements can be achieved without harming the model's FID scores.
Score: 46.723653095494896
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent text-to-image diffusion-based generative models have the stunning ability to generate highly detailed and photo-realistic images and achieve state-of-the-art low FID scores on challenging image generation benchmarks. However, one of the primary failure modes of these text-to-image generative models is in composing attributes, objects, and their associated relationships accurately into an image. In our paper, we investigate this compositionality-based failure mode and highlight that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. In particular, we show that (i) there exists an optimal text-embedding space that can generate highly coherent compositional scenes which shows that the output space of the CLIP text-encoder is sub-optimal, and (ii) we observe that the final token embeddings in CLIP are erroneous as they often include attention contributions from unrelated tokens in compositional prompts. Our main finding shows that the best compositional improvements can be achieved (without harming the model's FID scores) by fine-tuning {\it only} a simple linear projection on CLIP's representation space in Stable-Diffusion variants using a small set of compositional image-text pairs. This result demonstrates that the sub-optimality of the CLIP's output space is a major error source. We also show that re-weighting the erroneous attention contributions in CLIP can also lead to improved compositional performances, however these improvements are often less significant than those achieved by solely learning a linear projection head, highlighting erroneous attentions to be only a minor error source.

Related papers

VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z)
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [70.98890307376548]
We propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents during training. Our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning.
arXiv Detail & Related papers (2024-12-31T13:39:08Z)
DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization [15.920735314050296]
This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry. We propose DECOR, which projects text embeddings onto a vector space to undesired token vectors. Experimental results demonstrate that DECOR outperforms state-of-the-art customization models.
arXiv Detail & Related papers (2024-12-12T10:59:44Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z)
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)
Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z)
LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z)
Improving Compositional Text-to-image Generation with Large Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts. We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z)
Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling. We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z)
Text encoders bottleneck compositionality in contrastive vision-language models [76.2406963762722]
We train text-only recovery probes that aim to reconstruct captions from single-vector text representations. We find that CLIP's text encoder falls short on more compositional inputs. Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
arXiv Detail & Related papers (2023-05-24T08:48:44Z)
Attribute-Centric Compositional Text-to-Image Generation [45.12516226662346]
ACTIG is an attribute-centric compositional text-to-image generation framework. We present an attribute-centric feature augmentation and a novel image-free training scheme. We validate our framework on the CelebA-HQ and CUB datasets.
arXiv Detail & Related papers (2023-01-04T03:03:08Z)
ComCLIP: Training-Free Compositional Image and Text Matching [19.373706257771673]
Contrastive Language-Image Pretraining has demonstrated great zero-shot performance for matching images and text. We propose a novel textbftextittraining-free compositional CLIP model (ComCLIP) ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings.
arXiv Detail & Related papers (2022-11-25T01:37:48Z)
Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z)
No Token Left Behind: Explainability-Aided Image Classification and Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input. Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.