Discovering Divergent Representations between Text-to-Image Models
- URL: http://arxiv.org/abs/2509.08940v1
- Date: Wed, 10 Sep 2025 19:07:55 GMT
- Title: Discovering Divergent Representations between Text-to-Image Models
- Authors: Lisa Dunlap, Joseph E. Gonzalez, Trevor Darrell, Fabian Caba Heilbron, Josef Sivic, Bryan Russell,
- Abstract summary: We investigate when and how visual representations learned by two different generative models diverge.<n>We introduce CompCon, an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other.<n>We use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets.
- Score: 87.40710629963264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon
Related papers
- PromptSplit: Revealing Prompt-Level Disagreement in Generative Models [18.957478338649114]
Prompt-guided generative AI models have rapidly expanded across vision and language domains.<n>We propose PromptSplit, a kernel-based framework for detecting and analyzing prompt-dependent disagreement between generative models.<n>Experiments across text-to-image, text-to-text, and image-captioning settings demonstrate that PromptSplit accurately detects ground-truth behavioral differences.
arXiv Detail & Related papers (2026-02-03T20:53:10Z) - Can Modern Vision Models Understand the Difference Between an Object and a Look-alike? [10.10422216411379]
We study whether vision-language models such as CLIP capture this distinction.<n>We curated a dataset named RoLA (Real or Lookalike) of real and lookalike exemplars.<n>Applying this direction to image and text embeddings improves discrimination in cross-modal retrieval.
arXiv Detail & Related papers (2025-11-24T15:09:32Z) - Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders [41.08205377881149]
This work explores text-to-image retrieval for queries that specify or describe a semantic category.<n>We transform the text query into a visual query using a generative diffusion model.<n>Then, we estimate image-to-image similarity with a vision model.
arXiv Detail & Related papers (2025-08-29T18:24:38Z) - Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions [28.53636082915161]
Dual encoder architectures like Clip models map two types of inputs into a shared embedding space and predict similarities between them.<n>Common first-order feature-attribution methods explain importances of individual features and can, thus, only provide limited insights into dual encoders.<n>We first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs.
arXiv Detail & Related papers (2024-08-26T09:55:34Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - DEADiff: An Efficient Stylization Diffusion Model with Disentangled
Representations [64.43387739794531]
Current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles.
We introduce DEADiff to address this issue using the following two strategies.
DEAiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image.
arXiv Detail & Related papers (2024-03-11T17:35:23Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.<n>We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.<n>Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks [124.90137528319273]
In this paper, we present IMProv, a generative model that is able to in-context learn visual tasks from multimodal prompts.
We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions.
During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output.
arXiv Detail & Related papers (2023-12-04T09:48:29Z) - Localizing and Editing Knowledge in Text-to-Image Generative Models [62.02776252311559]
knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet.
We introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models.
arXiv Detail & Related papers (2023-10-20T17:31:12Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.