Vision-Language Binding in In-Context Image Generation
Abstract Overview
This paper studies how in-context image generation models, focusing on FLUX.2, route information between text tokens, reference-image tokens, and output image tokens during reference-conditioned editing. Using three causal interventions—T2I Lens, Attention Knockout, and I2I-to-I2I Patching—the authors test whether text tokens absorb visual information from the reference image and whether that information affects generation. Across 2,875 editing tasks spanning object addition, object removal, human customization, color transfer, and style transfer, they find a consistent division of labor between pathways. General, language-like properties such as color, style, and scene setting are written into text tokens, whereas pixel-exact properties such as specific human identity bypass text tokens and travel directly through image-to-image attention. The study further localizes this cross-modal binding primarily to padding tokens rather than the instruction content tokens.
Novelty
The main novelty is the identification and causal analysis of an implicit vision-language binding mechanism inside a unified-attention multimodal diffusion transformer for image editing. The paper is also distinctive in localizing the binding to text padding tokens and in introducing intervention-based probes that separate text-mediated transfer from direct image-to-image routing.
Results
The experiments show that text tokens in FLUX.2 reliably encode and causally transfer reference color and style, but not exact human identity. T2I Lens reveals high observation rates for color and style content in text-token activations, Attention Knockout shows that disrupting reference-to-text attention strongly breaks color/style transfer while reference-to-image knockout is more damaging for identity, and I2I-to-I2I Patching transfers color/style at high rates but has essentially no effect on identity. Additional padding-only tests indicate that the bound reference information resides mainly in padding tokens, while content tokens contribute little to this transfer.
Key Points
- Three causal probes—T2I Lens, Attention Knockout, and I2I-to-I2I Patching—are used to analyze how reference information moves through FLUX.2 during in-context image editing.
- Reference properties that are more abstract and describable in language, such as color, style, and scene context, are mediated by text tokens, whereas pixel-exact identity information travels through direct image-to-image attention.
- The cross-modal binding is localized mainly to text padding tokens rather than instruction content tokens, suggesting a structured and unexpected role for padding in multimodal generation.