Sparse Autoencoders Do Not Find Canonical Units of Analysis
- URL: http://arxiv.org/abs/2502.04878v1
- Date: Fri, 07 Feb 2025 12:33:08 GMT
- Title: Sparse Autoencoders Do Not Find Canonical Units of Analysis
- Authors: Patrick Leask, Bart Bussmann, Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda,
- Abstract summary: A common goal of mechanistic interpretability is to decompose the activations of neural networks into features.
Sparse autoencoders (SAEs) are a popular method for finding these features.
We use two techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic.
- Score: 6.0188420022822955
- License:
- Abstract: A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/
Related papers
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection.
For steering, we find that prompting outperforms all existing methods, followed by finetuning.
For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.
We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.
Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z) - Can sparse autoencoders make sense of latent representations? [0.0]
We find that latent representations can encode observable and directly connected hidden variables in superposition.
Applying to single-cell multi-omics data, we show that SAEs can uncover key biological processes.
arXiv Detail & Related papers (2024-10-15T10:16:01Z) - A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders [0.0]
Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs)
In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents?
Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability?
arXiv Detail & Related papers (2024-09-22T16:11:02Z) - States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly [72.24742240125369]
In this paper, we uncover the intrinsic ability to perform extended sequences of calculations without relying on chain-of-thought step-by-step solutions.
Remarkably, the most advanced models can directly output the results of two-digit number additions with lengths extending up to 15 addends.
arXiv Detail & Related papers (2024-07-16T06:27:22Z) - Interpreting Attention Layer Outputs with Sparse Autoencoders [3.201633659481912]
Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability.
In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition.
We show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work.
arXiv Detail & Related papers (2024-06-25T17:43:13Z) - Boosting Segment Anything Model Towards Open-Vocabulary Learning [69.24734826209367]
Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model.
Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics.
We present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework.
arXiv Detail & Related papers (2023-12-06T17:19:00Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - An Overview of Advances in Signal Processing Techniques for Classical
and Quantum Wideband Synthetic Apertures [67.73886953504947]
Synthetic aperture (SA) systems generate a larger aperture with greater angular resolution than is inherently possible from the physical dimensions of a single sensor alone.
We provide a brief overview of emerging signal processing trends in such spatially and spectrally wideband SA systems.
In particular, we cover the theoretical framework and practical underpinnings of wideband SA radar, channel sounding, sonar, radiometry, and optical applications.
arXiv Detail & Related papers (2022-05-11T16:19:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.