Related papers: Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

URL: http://arxiv.org/abs/2504.01137v2
Date: Wed, 13 Aug 2025 08:52:03 GMT
Title: Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models
Authors: Guy Kaplan, Michael Toker, Yuval Reif, Yonatan Belinkov, Roy Schwartz,
Abstract summary: We investigate how semantic information is distributed across token representations in a text-to-image model.<n>We find information is usually concentrated in only one or two of the item's tokens.<n>In some cases, items do influence each other's representation, often leading to misinterpretations.
Score: 35.85433370296494
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image (T2I) models generate images by encoding text prompts into token representations, which then guide the diffusion process. While prior work has largely focused on improving alignment by refining the diffusion process, we focus on the textual encoding stage. Specifically, we investigate how semantic information is distributed across token representations within and between lexical items (i.e., words or expressions conveying a single concept) in the prompt. We analyze information flow at two levels: (1) in-item representation-whether individual tokens represent their lexical item, and (2) cross-item interaction-whether information flows across the tokens of different lexical items. We use patching techniques to uncover surprising encoding patterns. We find information is usually concentrated in only one or two of the item's tokens-For example, in the item "San Francisco's Golden Gate Bridge", the token "Gate" sufficiently captures the entire expression while the other tokens could effectively be discarded. Lexical items also tend to remain isolated; for instance, the token "dog" encodes no visual information about "green" in the prompt "a green dog". However, in some cases, items do influence each other's representation, often leading to misinterpretations-e.g., in the prompt "a pool by a table", the token pool represents a pool table after contextualization. Our findings highlight the critical role of token-level encoding in image generation, suggesting that misalignment issues may originate already during the textual encoding.

Related papers

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs [40.11215282864732]
We introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language.<n>We evaluate this method on 10 different Vision-Language Model (VLM) models.<n>We show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans.
arXiv Detail & Related papers (2026-01-31T02:33:07Z)
Zero-Shot Chinese Character Recognition with Hierarchical Multi-Granularity Image-Text Aligning [52.92837273570818]
Chinese characters exhibit unique structures and compositional rules, allowing for the use of fine-grained semantic information in representation.<n>We propose a Hierarchical Multi-Granularity Image-Text Aligning (Hi-GITA) framework based on a contrastive paradigm.<n>Our proposed Hi-GITA outperforms existing zero-shot CCR methods.
arXiv Detail & Related papers (2025-05-30T17:39:14Z)
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models [64.52046218688295]
Text-to-image (T2I) diffusion models rely on encoded prompts to guide the image generation process.<n>We conduct the first in-depth analysis of the role padding tokens play in T2I models.<n>Our findings reveal three distinct scenarios: padding tokens may affect the model's output during text encoding, during the diffusion process, or be effectively ignored.
arXiv Detail & Related papers (2025-01-12T08:36:38Z)
Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting [8.572133295533643]
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image.
arXiv Detail & Related papers (2024-03-27T01:28:36Z)
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations [64.43387739794531]
Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. We introduce DEADiff to address this issue using the following two strategies. DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
arXiv Detail & Related papers (2024-03-11T17:35:23Z)
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines [33.49257838597258]
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations.
arXiv Detail & Related papers (2024-03-09T09:11:49Z)
Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens. We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z)
Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models [9.514940899499752]
Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images. However, when it comes to text-based image generation, they often fail to capture the intended meaning presented in the text. We propose Predicated Diffusion, a unified framework to express users' intentions.
arXiv Detail & Related papers (2023-10-03T15:45:50Z)
The Hidden Language of Diffusion Models [70.03691458189604]
We present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model. We find surprising visual connections between concepts, that transcend their textual semantics. We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings.
arXiv Detail & Related papers (2023-06-01T17:57:08Z)
Discriminative Class Tokens for Text-to-Image Diffusion Models [102.88033622546251]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z)
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP. We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z)
Compound Tokens: Channel Fusion for Vision-Language Representation Learning [36.19486792701684]
We present an effective method for fusing visual-and-language representations for question answering tasks. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting.
arXiv Detail & Related papers (2022-12-02T21:09:52Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
Text-Guided Neural Image Inpainting [20.551488941041256]
Inpainting task requires filling the corrupted image with contents coherent with the context. The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text. We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
arXiv Detail & Related papers (2020-04-07T09:04:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.