On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond
- URL: http://arxiv.org/abs/2506.15963v1
- Date: Thu, 19 Jun 2025 02:16:08 GMT
- Title: On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond
- Authors: Jingyi Cui, Qi Zhang, Yifei Wang, Yisen Wang,
- Abstract summary: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting features learned by large language models (LLMs)<n>It aims to recover complex superposed polysemantic features into interpretable monosemantic ones through feature reconstruction via sparsely activated neural networks.<n>Despite the wide applications of SAEs, it remains unclear under what conditions an SAE can fully recover the ground truth monosemantic features from the superposed polysemantic ones.
- Score: 36.107366496809675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting features learned by large language models (LLMs). It aims to recover complex superposed polysemantic features into interpretable monosemantic ones through feature reconstruction via sparsely activated neural networks. Despite the wide applications of SAEs, it remains unclear under what conditions an SAE can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, through theoretical analysis, we for the first time propose the necessary and sufficient conditions for identifiable SAEs (SAEs that learn unique and ground truth monosemantic features), including 1) extreme sparsity of the ground truth feature, 2) sparse activation of SAEs, and 3) enough hidden dimensions of SAEs. Moreover, when the identifiable conditions are not fully met, we propose a reweighting strategy to improve the identifiability. Specifically, following the theoretically suggested weight selection principle, we prove that the gap between the loss functions of SAE reconstruction and monosemantic feature reconstruction can be narrowed, so that the reweighted SAEs have better reconstruction of the ground truth monosemantic features than the uniformly weighted ones. In experiments, we validate our theoretical findings and show that our weighted SAE significantly improves feature monosemanticity and interpretability.
Related papers
- Dense SAE Latents Are Features, Not Bugs [75.08462524662072]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z) - Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z) - Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z) - Probing the Vulnerability of Large Language Models to Polysemantic Interventions [49.64902130083662]
We investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small)<n>Our analysis reveals a consistent polysemantic topology shared across both models.<n>Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models.
arXiv Detail & Related papers (2025-05-16T18:20:42Z) - Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
We introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations.<n>Our experimental results reveal that SAEs trained on Vision-Language Models significantly enhance the monosemanticity of individual neurons.
arXiv Detail & Related papers (2025-04-03T17:58:35Z) - Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry [11.968306791864034]
We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem.<n>We show that SAEs fail to recover concepts when these properties are ignored.<n>Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability.
arXiv Detail & Related papers (2025-03-03T18:47:40Z) - Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words [29.09237503747052]
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs)<n>In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words.
arXiv Detail & Related papers (2025-01-09T02:54:19Z) - Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity.
Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability.
We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z) - Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms.
We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z) - A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders [0.0]
We show that sparse decomposition and splitting of hierarchical features is not robust.<n>Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get "absorbed" into their children features.
arXiv Detail & Related papers (2024-09-22T16:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.