Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
- URL: http://arxiv.org/abs/2503.24277v2
- Date: Fri, 08 Aug 2025 11:47:22 GMT
- Title: Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
- Authors: Sewoong Lee, Adam Davies, Marc E. Canby, Julia Hockenmaier,
- Abstract summary: We introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA)<n>By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach.
- Score: 3.9230690073443166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.
Related papers
- Sparse Semantic Dimension as a Generalization Certificate for LLMs [53.681678236115836]
We introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers.<n>We validate this framework on GPT-2 Small and Gemma-2B, demonstrating that our bound provides non-vacuous certificates at realistic sample sizes.
arXiv Detail & Related papers (2026-02-11T21:45:18Z) - Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit [66.20349460098275]
We study the gradient descent learning of a general Gaussian Multi-index model $f(boldsymbolx)=g(boldsymbolUboldsymbolx)$ with hidden subspace $boldsymbolUin mathbbRrtimes d$.<n>We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error.
arXiv Detail & Related papers (2025-11-19T04:46:47Z) - DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code [5.38764489657443]
Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains.<n>Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two.<n>We propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language.
arXiv Detail & Related papers (2025-10-21T00:17:00Z) - Binary Autoencoder for Mechanistic Interpretability of Large Language Models [8.725176890854065]
We propose a novel Binary Autoencoder variant that enforces minimal entropy on minibatches of hidden activations.<n>For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function.<n>We empirically evaluate and leverage to characterize the inference dynamics of Large Language Models.
arXiv Detail & Related papers (2025-09-25T10:48:48Z) - Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z) - Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models [48.40096116617163]
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique.<n>This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets.
arXiv Detail & Related papers (2025-05-21T15:17:59Z) - SplInterp: Improving our Understanding and Training of Sparse Autoencoders [10.800240155402417]
Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability.<n>There have been recent doubts about the true utility of SAEs.<n>We develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs.
arXiv Detail & Related papers (2025-05-17T04:51:26Z) - Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders [1.0582505915332336]
It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions.<n>We find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together.<n>This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE.
arXiv Detail & Related papers (2025-05-16T23:30:17Z) - SAND: One-Shot Feature Selection with Additive Noise Distortion [3.5976830118932583]
We introduce a novel, non-intrusive feature selection layer that automatically identifies and selects the $k$ most informative features during neural network training.<n>Our method is uniquely simple, requiring no alterations to the loss function, network architecture, or post-selection retraining.<n>Our work demonstrates that simplicity and performance are not mutually exclusive, offering a powerful yet straightforward tool for feature selection in machine learning.
arXiv Detail & Related papers (2025-05-06T18:59:35Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - Are Sparse Autoencoders Useful? A Case Study in Sparse Probing [6.836374436707495]
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations.
One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines.
We test this by applying SAEs to the real-world task of LLM activation probing in four regimes.
arXiv Detail & Related papers (2025-02-23T18:54:15Z) - AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection.<n>For steering, we find that prompting outperforms all existing methods, followed by finetuning.<n>For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z) - Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks [1.4565166775409717]
Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units.<n>We introduce a family of evaluations based on SHIFT, a downstream task from Marks et al.<n>We adapt SHIFT into an automated metric of SAE quality; this involves replacing the human annotator with an LLM.<n>We also introduce the Targeted Probe Perturbation (TPP) metric that quantifies an SAE's ability to disentangle similar concepts.
arXiv Detail & Related papers (2024-11-28T03:58:48Z) - Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning [4.051777802443125]
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations.<n>We introduce Gradient SAEs, which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function.<n>We find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts.
arXiv Detail & Related papers (2024-11-15T18:03:52Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms.
We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z) - $\textit{X}^2$-DFD: A framework for e${X}$plainable and e${X}$tendable Deepfake Detection [52.14468236527728]
We propose a novel framework called $X2$-DFD, consisting of three core modules.
The first module, Model Feature Assessment (MFA), measures the detection capabilities of forgery features intrinsic to MLLMs, and gives a descending ranking of these features.
The second module, Strong Feature Strengthening (SFS), enhances the detection and explanation capabilities by fine-tuning the MLLM on a dataset constructed based on the top-ranked features.
The third module, Weak Feature Supplementing (WFS), improves the fine-tuned MLLM's capabilities on lower-ranked features by integrating external dedicated
arXiv Detail & Related papers (2024-10-08T15:28:33Z) - A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders [0.0]
We show that sparse decomposition and splitting of hierarchical features is not robust.<n>Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get "absorbed" into their children features.
arXiv Detail & Related papers (2024-09-22T16:11:02Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Improving Dictionary Learning with Gated Sparse Autoencoders [8.3037652157611]
Gated Sparse Autoencoder (Gated SAE) is a technique for unsupervised discovery of interpretable features in language models' (LMs) activations.
In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage.
In training SAEs on LMs of up to 7B parameters, Gated SAEs solve shrinkage, and require half as many firing features to achieve comparable reconstruction fidelity.
arXiv Detail & Related papers (2024-04-24T17:47:22Z) - Symmetric Equilibrium Learning of VAEs [56.56929742714685]
We view variational autoencoders (VAEs) as decoder-encoder pairs, which map distributions in the data space to distributions in the latent space and vice versa.
We propose a Nash equilibrium learning approach, which is symmetric with respect to the encoder and decoder and allows learning VAEs in situations where both the data and the latent distributions are accessible only by sampling.
arXiv Detail & Related papers (2023-07-19T10:27:34Z) - Understanding Augmentation-based Self-Supervised Representation Learning
via RKHS Approximation and Regression [53.15502562048627]
Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator.
This work delves into a statistical analysis of augmentation-based pretraining.
arXiv Detail & Related papers (2023-06-01T15:18:55Z) - Stealing the Decoding Algorithms of Language Models [56.369946232765656]
A key component of generating text from modern language models (LM) is the selection and tuning of decoding algorithms.
In this work, we show, for the first time, that an adversary with typical API access to an LM can steal the type and hyper parameters of its decoding algorithms.
Our attack is effective against popular LMs used in text generation APIs, including GPT-2, GPT-3 and GPT-Neo.
arXiv Detail & Related papers (2023-03-08T17:15:58Z) - Learning Efficient Coding of Natural Images with Maximum Manifold
Capacity Representations [4.666056064419346]
The efficient coding hypothesis proposes that the response properties of sensory systems are adapted to the statistics of their inputs.
While elegant, information theoretic properties are notoriously difficult to measure in practical settings or to employ as objective functions in optimization.
Here we outline the assumptions that allow manifold capacity to be optimized directly, yielding Maximum Manifold Capacity Representations (MMCR)
arXiv Detail & Related papers (2023-03-06T17:26:30Z) - Provably Efficient Neural Offline Reinforcement Learning via Perturbed
Rewards [33.88533898709351]
VIPeR amalgamates the randomized value function idea with the pessimism principle.
It implicitly obtains pessimism by simply perturbing the offline data multiple times.
It is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.
arXiv Detail & Related papers (2023-02-24T17:52:12Z) - Neural networks behave as hash encoders: An empirical study [79.38436088982283]
The input space of a neural network with ReLU-like activations is partitioned into multiple linear regions.
We demonstrate that this partition exhibits the following encoding properties across a variety of deep learning models.
Simple algorithms, such as $K$-Means, $K$-NN, and logistic regression, can achieve fairly good performance on both training and test data.
arXiv Detail & Related papers (2021-01-14T07:50:40Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z) - Learning Autoencoders with Relational Regularization [89.53065887608088]
A new framework is proposed for learning autoencoders of data distributions.
We minimize the discrepancy between the model and target distributions, with a emphrelational regularization
We implement the framework with two scalable algorithms, making it applicable for both probabilistic and deterministic autoencoders.
arXiv Detail & Related papers (2020-02-07T17:27:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.