Related papers: CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion

CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion

URL: http://arxiv.org/abs/2511.08075v1
Date: Wed, 12 Nov 2025 01:38:21 GMT
Title: CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion
Authors: Cameron Braunstein, Mariya Toneva, Eddy Ilg,
Abstract summary: We investigate whether internal representations used by text-to-image generation models contain semantic information meaningful to humans.<n>We find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process.<n>We conclude that the separately trained CLIP vision-language model is what determines the human-like semantic representation.
Score: 15.715635327960882
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Latent diffusion models such as Stable Diffusion achieve state-of-the-art results on text-to-image generation tasks. However, the extent to which these models have a semantic understanding of the images they generate is not well understood. In this work, we investigate whether the internal representations used by these models during text-to-image generation contain semantic information that is meaningful to humans. To do so, we perform probing on Stable Diffusion with simple regression layers that predict semantic attributes for objects and evaluate these predictions against human annotations. Surprisingly, we find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process. We demonstrate that groups of specific semantic attributes have markedly different decoding accuracy than the average, and are thus represented to different degrees. Finally, we show that attributes become more difficult to disambiguate from one another during the inverse diffusion process, further demonstrating the strongest semantic representation of object attributes in CLIP. We conclude that the separately trained CLIP vision-language model is what determines the human-like semantic representation, and that the diffusion process instead takes the role of a visual decoder.

Related papers

Disentangled representations via score-based variational autoencoders [21.955536401578616]
We present the Score-based Autoencoder for Multiscale Inference (SAMI)<n>SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process.<n>It can extract useful representations from pre-trained diffusion models with minimal additional training.
arXiv Detail & Related papers (2025-12-18T23:42:10Z)
Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z)
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z)
Learned representation-guided diffusion models for large-image generation [58.192263311786824]
We introduce a novel approach that trains diffusion models conditioned on embeddings from self-supervised learning (SSL) Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. Augmenting real data by generating variations of real images improves downstream accuracy for patch-level and larger, image-scale classification tasks.
arXiv Detail & Related papers (2023-12-12T14:45:45Z)
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter [47.29967666846132]
generative text-to-image diffusion models are highly efficient open-vocabulary semantic segmenters. We introduce a novel training-free approach named DiffSegmenter to generate realistic objects that are semantically faithful to the input text. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2023-09-06T06:31:08Z)
Unsupervised Semantic Correspondence Using Stable Diffusion [27.355330079806027]
We show that one can leverage this semantic knowledge within diffusion models to find semantic correspondences. We optimize the prompt embeddings of these models for maximum attention on the regions of interest. We significantly outperform any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
arXiv Detail & Related papers (2023-05-24T21:34:34Z)
What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z)
ContraFeat: Contrasting Deep Features for Semantic Discovery [102.4163768995288]
StyleGAN has shown strong potential for disentangled semantic control. Existing semantic discovery methods on StyleGAN rely on manual selection of modified latent layers to obtain satisfactory manipulation results. We propose a model that automates this process and achieves state-of-the-art semantic discovery performance.
arXiv Detail & Related papers (2022-12-14T15:22:13Z)
Diffusion Visual Counterfactual Explanations [51.077318228247925]
Visual Counterfactual Explanations (VCEs) are an important tool to understand the decisions of an image. Current approaches for the generation of VCEs are restricted to adversarially robust models and often contain non-realistic artefacts. In this paper, we overcome this by generating Visual Diffusion Counterfactual Explanations (DVCEs) for arbitrary ImageNet classifiers.
arXiv Detail & Related papers (2022-10-21T09:35:47Z)
Contextual Semantic Interpretability [16.18912769522768]
We look into semantic bottlenecks that capture context. We use a two-layer semantic bottleneck that gathers attributes into interpretable, sparse groups. Our model yields in predictions as accurate as a non-interpretable baseline when applied to a real-world test set of Flickr images.
arXiv Detail & Related papers (2020-09-18T09:47:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.