Steering CLIP's vision transformer with sparse autoencoders
- URL: http://arxiv.org/abs/2504.08729v1
- Date: Fri, 11 Apr 2025 17:56:09 GMT
- Title: Steering CLIP's vision transformer with sparse autoencoders
- Authors: Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, Blake Aaron Richards,
- Abstract summary: We train sparse autoencoders (SAEs) on CLIP's vision transformer to uncover key differences between vision and language processing.<n>We find that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model.
- Score: 20.63298721008492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15\% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.
Related papers
- TopK Language Models [23.574227495324568]
TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability.<n>These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts.
arXiv Detail & Related papers (2025-06-26T16:56:43Z) - Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z) - Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
Sparse Autoencoders (SAEs) have been shown to enhance interpretability and steerability in Large Language Models (LLMs)<n>In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations.
arXiv Detail & Related papers (2025-04-03T17:58:35Z) - Sparse Autoencoder Features for Classifications and Transferability [11.2185030332009]
We analyze Sparse Autoencoders (SAEs) for interpretable feature extraction from Large Language Models (LLMs)<n>Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations.
arXiv Detail & Related papers (2025-02-17T02:30:45Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology [15.83613460419667]
Histo Whole-Slide Images (WSIs) provide an important tool to assess cancer prognosis in computational pathology (CPATH)<n>Existing survival analysis approaches have made exciting progress, but they are generally limited to adopting highly-expressive network architectures.<n>This paper proposes a new Vision-Language-based SA (VLSA) paradigm to overcome performance bottlenecks.
arXiv Detail & Related papers (2024-09-14T08:47:45Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z) - Learning Self-Regularized Adversarial Views for Self-Supervised Vision
Transformers [105.89564687747134]
We propose a self-regularized AutoAugment method to learn views for self-supervised vision transformers.
First, we reduce the search cost of AutoView to nearly zero by learning views and network parameters simultaneously.
We also present a curated augmentation policy search space for self-supervised learning.
arXiv Detail & Related papers (2022-10-16T06:20:44Z) - Lite Vision Transformer with Enhanced Self-Attention [39.32480787105232]
We propose Lite Vision Transformer (LVT), a novel light-weight vision transformer network with two enhanced self-attention mechanisms.
For the low-level features, we introduce Convolutional Self-Attention (CSA)
For the high-level features, we propose Recursive Atrous Self-Attention (RASA)
arXiv Detail & Related papers (2021-12-20T19:11:53Z) - Self-Supervised Models are Continual Learners [79.70541692930108]
We show that self-supervised loss functions can be seamlessly converted into distillation mechanisms for Continual Learning.
We devise a framework for Continual self-supervised visual representation Learning that significantly improves the quality of the learned representations.
arXiv Detail & Related papers (2021-12-08T10:39:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.