Steering CLIP's vision transformer with sparse autoencoders
- URL: http://arxiv.org/abs/2504.08729v1
- Date: Fri, 11 Apr 2025 17:56:09 GMT
- Title: Steering CLIP's vision transformer with sparse autoencoders
- Authors: Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Wojciech Samek, Blake Aaron Richards,
- Abstract summary: We train sparse autoencoders (SAEs) on CLIP's vision transformer to uncover key differences between vision and language processing.<n>We find that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model.
- Score: 20.63298721008492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15\% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.
Related papers
- SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning [88.9014727048442]
SSL4RL is a novel framework that leverages self-supervised learning tasks as a source of verifiable rewards for RL-based fine-tuning.<n>Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals.<n>Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks.
arXiv Detail & Related papers (2025-10-18T09:22:40Z) - Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - Analysis of Variational Sparse Autoencoders [1.675385127117872]
We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability.<n>We introduce the Variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with sampling from learned Gaussian posteriors.<n>Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.
arXiv Detail & Related papers (2025-09-26T23:09:56Z) - Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation [110.03631978640298]
We present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain.<n>We identify three key properties that hinder the learning of high-level visual semantics.<n>We show that these issues can be effectively addressed by introducing self-supervised objectives during training.
arXiv Detail & Related papers (2025-09-18T17:47:40Z) - Probing the Representational Power of Sparse Autoencoders in Vision Models [16.82204018033778]
Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs)<n>Despite their popularity with language models, SAEs remain understudied in the visual domain.<n>We provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks.
arXiv Detail & Related papers (2025-08-15T07:29:42Z) - TopK Language Models [23.574227495324568]
TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability.<n>These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts.
arXiv Detail & Related papers (2025-06-26T16:56:43Z) - Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z) - Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
Sparse Autoencoders (SAEs) have been shown to enhance interpretability and steerability in Large Language Models (LLMs)<n>In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations.
arXiv Detail & Related papers (2025-04-03T17:58:35Z) - Sparse Autoencoder Features for Classifications and Transferability [11.2185030332009]
We analyze Sparse Autoencoders (SAEs) for interpretable feature extraction from Large Language Models (LLMs)<n>Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations.
arXiv Detail & Related papers (2025-02-17T02:30:45Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology [15.83613460419667]
Histo Whole-Slide Images (WSIs) provide an important tool to assess cancer prognosis in computational pathology (CPATH)<n>Existing survival analysis approaches have made exciting progress, but they are generally limited to adopting highly-expressive network architectures.<n>This paper proposes a new Vision-Language-based SA (VLSA) paradigm to overcome performance bottlenecks.
arXiv Detail & Related papers (2024-09-14T08:47:45Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z) - Learning Self-Regularized Adversarial Views for Self-Supervised Vision
Transformers [105.89564687747134]
We propose a self-regularized AutoAugment method to learn views for self-supervised vision transformers.
First, we reduce the search cost of AutoView to nearly zero by learning views and network parameters simultaneously.
We also present a curated augmentation policy search space for self-supervised learning.
arXiv Detail & Related papers (2022-10-16T06:20:44Z) - Lite Vision Transformer with Enhanced Self-Attention [39.32480787105232]
We propose Lite Vision Transformer (LVT), a novel light-weight vision transformer network with two enhanced self-attention mechanisms.
For the low-level features, we introduce Convolutional Self-Attention (CSA)
For the high-level features, we propose Recursive Atrous Self-Attention (RASA)
arXiv Detail & Related papers (2021-12-20T19:11:53Z) - Self-Supervised Models are Continual Learners [79.70541692930108]
We show that self-supervised loss functions can be seamlessly converted into distillation mechanisms for Continual Learning.
We devise a framework for Continual self-supervised visual representation Learning that significantly improves the quality of the learned representations.
arXiv Detail & Related papers (2021-12-08T10:39:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.