Related papers: Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

URL: http://arxiv.org/abs/2502.06755v1
Date: Mon, 10 Feb 2025 18:32:41 GMT
Title: Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models
Authors: Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su,
Abstract summary: We present a unified framework using sparse autoencoders (SAEs) to discover human-interpretable visual features.<n>We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training.
Score: 27.806966289284528
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

Related papers

Exploring Conditions for Diffusion models in Robotic Control [70.27711404291573]
We explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control.<n>We find that naively applying textual conditions yields minimal or even negative gains in control tasks.<n>We propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details.
arXiv Detail & Related papers (2025-10-17T10:24:14Z)
Probing the Representational Power of Sparse Autoencoders in Vision Models [16.82204018033778]
Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs)<n>Despite their popularity with language models, SAEs remain understudied in the visual domain.<n>We provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks.
arXiv Detail & Related papers (2025-08-15T07:29:42Z)
Toward universal steering and monitoring of AI models [16.303681959333883]
We develop a scalable approach for extracting linear representations of general concepts in large-scale AI models.<n>We show how these representations enable model steering, through which we expose vulnerabilities, misaligned behaviors, and improve model capabilities.
arXiv Detail & Related papers (2025-02-06T01:41:48Z)
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks.<n>Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z)
Explaining the Impact of Training on Vision Models via Activation Clustering [2.8792218859042453]
This paper introduces Neuro-Activated Vision Explanations (NAVE) NAVE is a method for extracting and visualizing the internal representations of vision model encoders. By clustering feature activations, NAVE provides insights into learned semantics without fine-tuning.
arXiv Detail & Related papers (2024-11-29T13:42:10Z)
Fill in the blanks: Rethinking Interpretability in vision [0.0]
We re-think vision-model explainability from a novel perspective, to probe the general input structure that a model has learnt during its training. Experiments on standard vision datasets and pre-trained models reveal consistent patterns, and could be intergrated as an additional model-agnostic explainability tool.
arXiv Detail & Related papers (2024-11-15T15:31:06Z)
Interpreting and Controlling Vision Foundation Models via Text Explanations [45.30541722925515]
We present a framework for interpreting vision transformer's latent tokens with natural language. Our approach enables understanding of model visual reasoning procedure without needing additional model training or data collection.
arXiv Detail & Related papers (2023-10-16T17:12:06Z)
Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration. We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE. We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z)
InDL: A New Dataset and Benchmark for In-Diagram Logic Interpretation based on Visual Illusion [1.7980584146314789]
This paper introduces a novel approach to evaluating deep learning models' capacity for in-diagram logic interpretation. We establish a unique dataset, InDL, designed to rigorously test and benchmark these models. We utilize six classic geometric optical illusions to create a comparative framework between human and machine visual perception.
arXiv Detail & Related papers (2023-05-28T13:01:32Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
Model-Based Inverse Reinforcement Learning from Visual Demonstrations [20.23223474119314]
We present a gradient-based inverse reinforcement learning framework that learns cost functions when given only visual human demonstrations. The learned cost functions are then used to reproduce the demonstrated behavior via visual model predictive control. We evaluate our framework on hardware on two basic object manipulation tasks.
arXiv Detail & Related papers (2020-10-18T17:07:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.