Related papers: Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

URL: http://arxiv.org/abs/2601.20075v1
Date: Tue, 27 Jan 2026 21:39:00 GMT
Title: Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning
Authors: Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, Stefan Scherer,
Abstract summary: Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning.<n>CLIP's dense and opaque latent representations pose significant interpretability challenges.<n>We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant.
Score: 11.31435293510471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP's dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP's inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models.

Related papers

Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding [53.18433310890516]
Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings.<n>We propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning.
arXiv Detail & Related papers (2025-11-11T17:23:02Z)
Scaling Language-Centric Omnimodal Representation Learning [26.999264997449586]
multimodal embedding approaches leveraging large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results.<n>This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining.<n>We propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb.
arXiv Detail & Related papers (2025-10-13T17:53:52Z)
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions [16.90061119174727]
We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations.<n> Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs.<n> Secondly, CLIP-IN incorporates long captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP.
arXiv Detail & Related papers (2025-08-04T11:57:10Z)
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [72.02635550088546]
This work explores how large language models (LLMs) can enhance CLIP's capability, especially for processing longer and more complex image captions.<n>We introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs.<n>Our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization. Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z)
Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations [6.990891188823598]
We present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision.<n>Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path.<n>We evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP.
arXiv Detail & Related papers (2024-05-23T07:18:08Z)
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) [22.364723506539974]
We show that the semantic structure of CLIP's latent space can be leveraged to provide interpretability. We propose a novel method, Sparse Linear Concept Embeddings, for transforming CLIP representations into sparse linear combinations of human-interpretable concepts.
arXiv Detail & Related papers (2024-02-16T00:04:36Z)
Concept-Guided Prompt Learning for Generalization in Vision-Language Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models. We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache. In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.