Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models
- URL: http://arxiv.org/abs/2504.01137v1
- Date: Tue, 01 Apr 2025 19:17:44 GMT
- Title: Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models
- Authors: Guy Kaplan, Michael Toker, Yuval Reif, Yonatan Belinkov, Roy Schwartz,
- Abstract summary: Text-to-Image (T2I) models often suffer from issues such as semantic leakage, incorrect feature binding, and omissions of key concepts in the generated image.<n>This work studies these phenomena by looking into the role of information flow between textual token representations.
- Score: 35.85433370296494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-Image (T2I) models often suffer from issues such as semantic leakage, incorrect feature binding, and omissions of key concepts in the generated image. This work studies these phenomena by looking into the role of information flow between textual token representations. To this end, we generate images by applying the diffusion component on a subset of contextual token representations in a given prompt and observe several interesting phenomena. First, in many cases, a word or multiword expression is fully represented by one or two tokens, while other tokens are redundant. For example, in "San Francisco's Golden Gate Bridge", the token "gate" alone captures the full expression. We demonstrate the redundancy of these tokens by removing them after textual encoding and generating an image from the resulting representation. Surprisingly, we find that this process not only maintains image generation performance but also reduces errors by 21\% compared to standard generation. We then show that information can also flow between different expressions in a sentence, which often leads to semantic leakage. Based on this observation, we propose a simple, training-free method to mitigate semantic leakage: replacing the leaked item's representation after the textual encoding with its uncontextualized representation. Remarkably, this simple approach reduces semantic leakage by 85\%. Overall, our work provides a comprehensive analysis of information flow across textual tokens in T2I models, offering both novel insights and practical benefits.
Related papers
- Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models [64.52046218688295]
Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process.<n>We conduct the first in-depth analysis of the role padding tokens play in T2I models.<n>Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored.
arXiv Detail & Related papers (2025-01-12T08:36:38Z) - DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations [64.43387739794531]
Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles.
We introduce DEADiff to address this issue using the following two strategies.
DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
arXiv Detail & Related papers (2024-03-11T17:35:23Z) - Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines [33.49257838597258]
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process.
We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations.
arXiv Detail & Related papers (2024-03-09T09:11:49Z) - Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion
Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models.
The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens.
We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z) - Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models [9.514940899499752]
Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images.
However, when it comes to text-based image generation, they often fail to capture the intended meaning presented in the text.
We propose Predicated Diffusion, a unified framework to express users' intentions.
arXiv Detail & Related papers (2023-10-03T15:45:50Z) - The Hidden Language of Diffusion Models [70.03691458189604]
We present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model.
We find surprising visual connections between concepts, that transcend their textual semantics.
We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings.
arXiv Detail & Related papers (2023-06-01T17:57:08Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [102.88033622546251]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - What Are You Token About? Dense Retrieval as Distributions Over the
Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space.
We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z) - Text-Guided Neural Image Inpainting [20.551488941041256]
Inpainting task requires filling the corrupted image with contents coherent with the context.
The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text.
We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
arXiv Detail & Related papers (2020-04-07T09:04:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.