Related papers: SplInterp: Improving our Understanding and Training of Sparse Autoencoders

SplInterp: Improving our Understanding and Training of Sparse Autoencoders

URL: http://arxiv.org/abs/2505.11836v1
Date: Sat, 17 May 2025 04:51:26 GMT
Title: SplInterp: Improving our Understanding and Training of Sparse Autoencoders
Authors: Jeremy Budd, Javier Ideami, Benjamin Macdowall Rynne, Keith Duggar, Randall Balestriero,
Abstract summary: Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability.<n>There have been recent doubts about the true utility of SAEs.<n>We develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs.
Score: 10.800240155402417
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``$k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``$k$-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp

Related papers

Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models [48.40096116617163]
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique.<n>This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets.
arXiv Detail & Related papers (2025-05-21T15:17:59Z)
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality [3.9230690073443166]
We show that the magnitude of sparse feature vectors can be approximated using their corresponding dense vector with a closed-form error bound.<n>We introduce Approximate Activation Feature (AFA), which approximates the magnitude of the ground-truth sparse feature vector.<n>We demonstrate that top-AFA SAEs achieve reconstruction loss comparable to that of state-of-the-art top-k SAEs.
arXiv Detail & Related papers (2025-03-31T16:22:11Z)
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations [21.142967037533175]
We propose Jacobian SAEs, which yield sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them.<n>We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs.<n>This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.
arXiv Detail & Related papers (2025-02-25T12:21:45Z)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection.<n>For steering, we find that prompting outperforms all existing methods, followed by finetuning.<n>For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z)
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders [0.0]
A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations.<n>However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference.<n>We prove that an SAE encoder is inherently insufficient for accurate sparse inference, even in solvable cases.
arXiv Detail & Related papers (2024-11-20T08:21:53Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z)
AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models [94.82766517752418]
We propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. Our results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs.
arXiv Detail & Related papers (2024-10-14T03:35:11Z)
Interpreting Attention Layer Outputs with Sparse Autoencoders [3.201633659481912]
Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work.
arXiv Detail & Related papers (2024-06-25T17:43:13Z)
RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.