Related papers: Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)

Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)

URL: http://arxiv.org/abs/2402.10376v1
Date: Fri, 16 Feb 2024 00:04:36 GMT
Title: Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Authors: Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P. Calmon, Himabindu Lakkaraju
Abstract summary: We show that CLIP's latent space is highly structured, and that CLIP representations can be decomposed into their underlying semantic components. We propose a novel method, Sparse Linear Concept Embeddings (SpLiCE), for transforming CLIP representations into sparse linear combinations of human-interpretable concepts.
Score: 23.993903128858832
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: CLIP embeddings have demonstrated remarkable performance across a wide range of computer vision tasks. However, these high-dimensional, dense vector representations are not easily interpretable, restricting their usefulness in downstream applications that require transparency. In this work, we empirically show that CLIP's latent space is highly structured, and consequently that CLIP representations can be decomposed into their underlying semantic components. We leverage this understanding to propose a novel method, Sparse Linear Concept Embeddings (SpLiCE), for transforming CLIP representations into sparse linear combinations of human-interpretable concepts. Distinct from previous work, SpLiCE does not require concept labels and can be applied post hoc. Through extensive experimentation with multiple real-world datasets, we validate that the representations output by SpLiCE can explain and even replace traditional dense CLIP representations, maintaining equivalent downstream performance while significantly improving their interpretability. We also demonstrate several use cases of SpLiCE representations including detecting spurious correlations, model editing, and quantifying semantic shifts in datasets.

Related papers

Cross-Layer Discrete Concept Discovery for Interpreting Language Models [13.842670153893977]
Cross-layer VQ-VAE is a framework that uses vector quantization to map representations across layers.<n>Our approach uniquely combines top-k temperature-based sampling during quantization with EMA codebook updates.
arXiv Detail & Related papers (2025-06-24T22:43:36Z)
Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval [13.31210969917096]
We propose a novel interpretability framework for Dense Passage Retrieval (DPR) models.<n>We generate natural language descriptions for each latent concept, enabling human interpretations of both the dense embeddings and the query-document similarity scores of DPR models.<n>We show that Concept-Level Sparse Retrieval (CL-SR) achieves high index-space and computational efficiency while maintaining robust performance across vocabulary and semantic mismatches.
arXiv Detail & Related papers (2025-05-28T02:50:17Z)
Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors [50.7383184560431]
Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting.<n>We propose a concise CL approach for CLIP based on incremental prompt tuning.<n>We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting.
arXiv Detail & Related papers (2025-05-27T03:51:37Z)
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception [21.87721909270275]
DeCLIP is a novel framework that enhances CLIP with content'' and context'' features respectively.<n>It significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks.
arXiv Detail & Related papers (2025-05-07T13:46:34Z)
DiffCLIP: Differential Attention Meets CLIP [57.396578974401734]
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks.
arXiv Detail & Related papers (2025-03-09T14:04:09Z)
ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation [14.84547724351634]
We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. We propose a self-supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. We validate ULTra for model interpretation on both synthetic and real-world scenarios, including Object Selection and interpretable text summarization.
arXiv Detail & Related papers (2024-11-15T19:36:50Z)
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z)
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z)
Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet [4.597864989500202]
We propose a novel saliency methodology that explains the CLIP embedding of an image by exploiting the multi-modal nature of the embeddings. ConVis makes use of lexical information from WordNet to compute task-agnostic Saliency Maps for any concept, not limited to concepts the end model was trained on.
arXiv Detail & Related papers (2024-05-23T13:41:17Z)
Refining Skewed Perceptions in Vision-Language Models through Visual Representations [0.033483662989441935]
Large vision-language models (VLMs) have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications.
arXiv Detail & Related papers (2024-05-22T22:03:11Z)
Demystifying Embedding Spaces using Large Language Models [26.91321899603332]
This paper addresses the challenge of making embeddings more interpretable and broadly useful. By employing Large Language Models (LLMs) to directly interact with embeddings, we transform abstract vectors into understandable narratives. We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems.
arXiv Detail & Related papers (2023-10-06T05:27:28Z)
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations. We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z)
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category. We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z)
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models [110.00434385712786]
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs) We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice.
arXiv Detail & Related papers (2023-02-28T08:11:56Z)
Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z)
Closed-Form Factorization of Latent Semantics in GANs [65.42778970898534]
A rich set of interpretable dimensions has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. We propose a closed-form factorization algorithm for latent semantic discovery by directly decomposing the pre-trained weights.
arXiv Detail & Related papers (2020-07-13T18:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.