Dense SAE Latents Are Features, Not Bugs
- URL: http://arxiv.org/abs/2506.15679v1
- Date: Wed, 18 Jun 2025 17:59:35 GMT
- Title: Dense SAE Latents Are Features, Not Bugs
- Authors: Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark,
- Abstract summary: We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
- Score: 75.08462524662072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs -- suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.
Related papers
- From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit [16.996218963146788]
We show that MP-SAE unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features.<n>We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time.
arXiv Detail & Related papers (2025-06-03T17:24:55Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.<n>We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.<n>The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - Analyzing (In)Abilities of SAEs via Formal Languages [14.71170261508271]
We train sparse autoencoders on a synthetic testbed of formal languages.<n>We find performance is sensitive to inductive biases of the training pipeline.<n>We argue that causality has to become a central target in SAE training.
arXiv Detail & Related papers (2024-10-15T16:42:13Z) - A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders [0.0]
We show that sparse decomposition and splitting of hierarchical features is not robust.<n>Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get "absorbed" into their children features.
arXiv Detail & Related papers (2024-09-22T16:11:02Z) - The Remarkable Robustness of LLMs: Stages of Inference? [5.346230590800585]
We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference.<n>Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any fine-tuning.
arXiv Detail & Related papers (2024-06-27T17:57:03Z) - Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks.
We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z) - Interventional Causal Representation Learning [75.18055152115586]
Causal representation learning seeks to extract high-level latent factors from low-level sensory data.
Can interventional data facilitate causal representation learning?
We show that interventional data often carries geometric signatures of the latent factors' support.
arXiv Detail & Related papers (2022-09-24T04:59:03Z) - Weakly Supervised Representation Learning with Sparse Perturbations [82.39171485023276]
We show that if one has weak supervision from observations generated by sparse perturbations of the latent variables, identification is achievable under unknown continuous latent distributions.
We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.
arXiv Detail & Related papers (2022-06-02T15:30:07Z) - Structure-Aware Feature Generation for Zero-Shot Learning [108.76968151682621]
We introduce a novel structure-aware feature generation scheme, termed as SA-GAN, to account for the topological structure in learning both the latent space and the generative networks.
Our method significantly enhances the generalization capability on unseen-classes and consequently improve the classification performance.
arXiv Detail & Related papers (2021-08-16T11:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.