Understanding sparse autoencoder scaling in the presence of feature manifolds
- URL: http://arxiv.org/abs/2509.02565v2
- Date: Thu, 04 Sep 2025 17:55:36 GMT
- Title: Understanding sparse autoencoder scaling in the presence of feature manifolds
- Authors: Eric J. Michaud, Liv Gorton, Tom McGrath,
- Abstract summary: We adapt a capacity-allocation model from the neural scaling literature to understand SAE scaling.<n>We discuss whether SAEs are in a pathological regime in the wild.
- Score: 5.2924382061650395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse autoencoders (SAEs) model the activations of a neural network as linear combinations of sparsely occurring directions of variation (latents). The ability of SAEs to reconstruct activations follows scaling laws w.r.t. the number of latents. In this work, we adapt a capacity-allocation model from the neural scaling literature (Brill, 2024) to understand SAE scaling, and in particular, to understand how "feature manifolds" (multi-dimensional features) influence scaling behavior. Consistent with prior work, the model recovers distinct scaling regimes. Notably, in one regime, feature manifolds have the pathological effect of causing SAEs to learn far fewer features in data than there are latents in the SAE. We provide some preliminary discussion on whether or not SAEs are in this pathological regime in the wild.
Related papers
- Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces [75.45093712182624]
We introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO)<n>We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators.<n>We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.
arXiv Detail & Related papers (2025-09-03T21:57:03Z) - Dense SAE Latents Are Features, Not Bugs [75.08462524662072]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z) - Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z) - Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z) - Emergent Symbol-like Number Variables in Artificial Neural Networks [34.388552536773034]
We show that we can interpret raw NN activity through the lens of simplified Symbolic Algorithms (SAs)<n>We extend the DAS framework to a broader class of alignment functions that more flexibly capture NN activity in terms of interpretable variables from SAs.<n>We show that recurrent models can develop graded, symbol-like number variables in their neural activity.
arXiv Detail & Related papers (2025-01-10T18:03:46Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.<n>We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.<n>Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z) - Can sparse autoencoders make sense of gene expression latent variable models? [0.0]
This work explores the potential of SAEs for decomposing embeddings in complex and high-dimensional biological data.<n>The application to embeddings from pretrained single-cell models shows that SAEs can find and steer key biological processes.<n> scFeatureLens is an automated interpretability approach for linking SAE features and biological concepts from gene sets.
arXiv Detail & Related papers (2024-10-15T10:16:01Z) - Mechanistic Permutability: Match Features Across Layers [4.2056926734482065]
We introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network.<n>Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.
arXiv Detail & Related papers (2024-10-10T06:55:38Z) - Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence.
We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN)
We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.