Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
- URL: http://arxiv.org/abs/2402.10376v1
- Date: Fri, 16 Feb 2024 00:04:36 GMT
- Title: Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
- Authors: Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P. Calmon,
Himabindu Lakkaraju
- Abstract summary: We show that CLIP's latent space is highly structured, and that CLIP representations can be decomposed into their underlying semantic components.
We propose a novel method, Sparse Linear Concept Embeddings (SpLiCE), for transforming CLIP representations into sparse linear combinations of human-interpretable concepts.
- Score: 23.993903128858832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CLIP embeddings have demonstrated remarkable performance across a wide range
of computer vision tasks. However, these high-dimensional, dense vector
representations are not easily interpretable, restricting their usefulness in
downstream applications that require transparency. In this work, we empirically
show that CLIP's latent space is highly structured, and consequently that CLIP
representations can be decomposed into their underlying semantic components. We
leverage this understanding to propose a novel method, Sparse Linear Concept
Embeddings (SpLiCE), for transforming CLIP representations into sparse linear
combinations of human-interpretable concepts. Distinct from previous work,
SpLiCE does not require concept labels and can be applied post hoc. Through
extensive experimentation with multiple real-world datasets, we validate that
the representations output by SpLiCE can explain and even replace traditional
dense CLIP representations, maintaining equivalent downstream performance while
significantly improving their interpretability. We also demonstrate several use
cases of SpLiCE representations including detecting spurious correlations,
model editing, and quantifying semantic shifts in datasets.
Related papers
- Unified Lexical Representation for Interpretable Visual-Language Alignment [52.059812317944434]
We introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design.
We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset.
arXiv Detail & Related papers (2024-07-25T07:35:27Z) - ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality.
We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z) - Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet [4.597864989500202]
We propose a novel saliency methodology that explains the CLIP embedding of an image by exploiting the multi-modal nature of the embeddings.
ConVis makes use of lexical information from WordNet to compute task-agnostic Saliency Maps for any concept, not limited to concepts the end model was trained on.
arXiv Detail & Related papers (2024-05-23T13:41:17Z) - Refining Skewed Perceptions in Vision-Language Models through Visual Representations [0.033483662989441935]
Large vision-language models (VLMs) have become foundational, demonstrating remarkable success across a variety of downstream tasks.
Despite their advantages, these models inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment.
This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications.
arXiv Detail & Related papers (2024-05-22T22:03:11Z) - Demystifying Embedding Spaces using Large Language Models [26.91321899603332]
This paper addresses the challenge of making embeddings more interpretable and broadly useful.
By employing Large Language Models (LLMs) to directly interact with embeddings, we transform abstract vectors into understandable narratives.
We demonstrate our approach on a variety of diverse tasks, including: enhancing concept activation vectors (CAVs), communicating novel embedded entities, and decoding user preferences in recommender systems.
arXiv Detail & Related papers (2023-10-06T05:27:28Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z) - Linear Spaces of Meanings: Compositional Structures in Vision-Language
Models [110.00434385712786]
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs)
We first present a framework for understanding compositional structures from a geometric perspective.
We then explain what these structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice.
arXiv Detail & Related papers (2023-02-28T08:11:56Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.