Related papers: From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

URL: http://arxiv.org/abs/2505.20229v1
Date: Mon, 26 May 2025 17:08:02 GMT
Title: From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
Authors: Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek,
Abstract summary: We introduce a framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions.<n>Our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts.
Score: 14.30327576545802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at https://github.com/maxdreyer/attributing-clip.

Related papers

ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching [9.610261779024219]
ABE-CLIP is a training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models.<n>We employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text.<n>By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity.
arXiv Detail & Related papers (2025-12-19T02:36:51Z)
A Conditional Probability Framework for Compositional Zero-shot Learning [86.86063926727489]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions.<n>Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning.<n>We adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies.
arXiv Detail & Related papers (2025-07-23T10:20:52Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID.<n>It adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning [38.507994878183474]
We introduce Semantically contextualized VIsual Patches (SVIP) for Zero-shot learning (ZSL)<n>We propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space.<n>SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations.
arXiv Detail & Related papers (2025-03-13T10:59:51Z)
KNN Transformer with Pyramid Prompts for Few-Shot Learning [52.735070934075736]
Few-Shot Learning aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features.
arXiv Detail & Related papers (2024-10-14T07:39:30Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning [46.25534556546322]
We propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions. Our method performs favorably against previous state-of-the-arts considering few-shot classification settings.
arXiv Detail & Related papers (2024-06-17T06:28:58Z)
Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues [55.97779732051921]
A new learning strategy is proposed to explicitly incorporate au cues into classifier training. We show that our strategy can improve layer-wise interpretability without degrading classification performance.
arXiv Detail & Related papers (2024-02-01T02:13:49Z)
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object) We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z)
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning [53.32576252950481]
Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks.
arXiv Detail & Related papers (2023-05-19T07:39:17Z)
Learning CLIP Guided Visual-Text Fusion Transformer for Video-based Pedestrian Attribute Recognition [23.748227536306295]
We propose to understand human attributes using video frames that can make full use of temporal information. We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
arXiv Detail & Related papers (2023-04-20T05:18:28Z)
DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins. Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.