Ensembling Sparse Autoencoders
- URL: http://arxiv.org/abs/2505.16077v1
- Date: Wed, 21 May 2025 23:31:21 GMT
- Title: Ensembling Sparse Autoencoders
- Authors: Soham Gadgil, Chris Lin, Su-In Lee,
- Abstract summary: Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
- Score: 10.81463830315253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs trained with different initial weights can learn different features, demonstrating that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we propose to ensemble multiple SAEs through naive bagging and boosting. Specifically, SAEs trained with different weight initializations are ensembled in naive bagging, whereas SAEs sequentially trained to minimize the residual error are ensembled in boosting. We evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability. Furthermore, ensembling SAEs performs better than applying a single SAE on downstream tasks such as concept detection and spurious correlation removal, showing improved practical utility.
Related papers
- TopK Language Models [23.574227495324568]
TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability.<n>These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts.
arXiv Detail & Related papers (2025-06-26T16:56:43Z) - Dense SAE Latents Are Features, Not Bugs [75.08462524662072]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z) - Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z) - Boosting All-in-One Image Restoration via Self-Improved Privilege Learning [72.35265021054471]
Self-Improved Privilege Learning (SIPL) is a novel paradigm that overcomes limitations by extending the utility of privileged information (PI) beyond training into the inference stage.<n>Central to SIPL is Proxy Fusion, a lightweight module incorporating a learnable Privileged Dictionary.<n>Extensive experiments demonstrate that SIPL significantly advances the state-of-the-art on diverse all-in-one image restoration benchmarks.
arXiv Detail & Related papers (2025-05-30T04:36:52Z) - Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders [1.0582505915332336]
It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions.<n>We find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together.<n>This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE.
arXiv Detail & Related papers (2025-05-16T23:30:17Z) - Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders [41.1110443501488]
We introduce a novel metric to assess the monolinguality of features obtained from SAEs.<n>We show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected.<n>We leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.
arXiv Detail & Related papers (2025-05-08T10:24:44Z) - Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
Sparse Autoencoders (SAEs) have been shown to enhance interpretability and steerability in Large Language Models (LLMs)<n>In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations.
arXiv Detail & Related papers (2025-04-03T17:58:35Z) - Sparse Autoencoders Trained on the Same Data Learn Different Features [0.7234862895932991]
Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in large language models.<n>Our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features.
arXiv Detail & Related papers (2025-01-28T01:24:16Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.<n>We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.<n>Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z) - Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms.
We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z) - Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs [47.14410674505256]
We present a case study of syntax acquisition in masked language models (MLMs)<n>We study Syntactic Attention Structure (SAS), a naturally emerging property of accessibles wherein specific Transformer heads tend to focus on specific syntactic relations.<n>We examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities.
arXiv Detail & Related papers (2023-09-13T20:57:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.