Related papers: Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

URL: http://arxiv.org/abs/2508.09363v1
Date: Tue, 12 Aug 2025 21:45:10 GMT
Title: Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
Authors: Charles O'Neill, Mudith Jayasekara, Max Kirkby,
Abstract summary: We show that restricting SAE training to a well-defined domain reallocates capacity to domain-specific features.<n>We find that domain-confined SAEs explain up to 20% more variance, achieve higher loss recovery, and reduce linear residual error.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only high-frequency, generic patterns. This often results in significant linear ``dark matter'' in reconstruction error and produces latents that fragment or absorb each other, complicating interpretation. We show that restricting SAE training to a well-defined domain (medical text) reallocates capacity to domain-specific features, improving both reconstruction fidelity and interpretability. Training JumpReLU SAEs on layer-20 activations of Gemma-2 models using 195k clinical QA examples, we find that domain-confined SAEs explain up to 20\% more variance, achieve higher loss recovery, and reduce linear residual error compared to broad-domain SAEs. Automated and human evaluations confirm that learned features align with clinically meaningful concepts (e.g., ``taste sensations'' or ``infectious mononucleosis''), rather than frequent but uninformative tokens. These domain-specific SAEs capture relevant linear structure, leaving a smaller, more purely nonlinear residual. We conclude that domain-confinement mitigates key limitations of broad-domain SAEs, enabling more complete and interpretable latent decompositions, and suggesting the field may need to question ``foundation-model'' scaling for general-purpose SAEs.

Related papers

Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection [62.89126207012712]
Face Presentation Attack Detection (PAD) demands incremental learning to combat spoofing tactics and domains.<n>Privacy regulations forbid retaining past data, necessitating rehearsal-free learning (RF-IL)
arXiv Detail & Related papers (2025-12-22T04:30:11Z)
Residual-SwinCA-Net: A Channel-Aware Integrated Residual CNN-Swin Transformer for Malignant Lesion Segmentation in BUSI [1.759752510445115]
A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study.<n>For learning global dependencies, Swin Transformer blocks are customized using internal residual pathways.<n>The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset.
arXiv Detail & Related papers (2025-12-09T04:52:30Z)
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z)
Analysis of Variational Sparse Autoencoders [1.675385127117872]
We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability.<n>We introduce the Variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with sampling from learned Gaussian posteriors.<n>Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.
arXiv Detail & Related papers (2025-09-26T23:09:56Z)
Understanding sparse autoencoder scaling in the presence of feature manifolds [5.2924382061650395]
We adapt a capacity-allocation model from the neural scaling literature to understand SAE scaling.<n>We discuss whether SAEs are in a pathological regime in the wild.
arXiv Detail & Related papers (2025-09-02T17:59:50Z)
Teach Old SAEs New Domain Tricks with Boosting [3.3865605512957453]
This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining.<n>We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts.<n>By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics.
arXiv Detail & Related papers (2025-07-17T10:57:49Z)
On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond [36.107366496809675]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting features learned by large language models (LLMs)<n>It aims to recover complex superposed polysemantic features into interpretable monosemantic ones through feature reconstruction via sparsely activated neural networks.<n>Despite the wide applications of SAEs, it remains unclear under what conditions an SAE can fully recover the ground truth monosemantic features from the superposed polysemantic ones.
arXiv Detail & Related papers (2025-06-19T02:16:08Z)
Dense SAE Latents Are Features, Not Bugs [75.08462524662072]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z)
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains [92.36624674516553]
Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
arXiv Detail & Related papers (2025-03-31T08:22:49Z)
Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization [68.41367635546183]
Single Domain Generalization aims to train models with consistent performance across diverse scenarios using data from a single source.<n>We propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization.
arXiv Detail & Related papers (2025-03-17T18:08:03Z)
Interpreting CLIP with Hierarchical Sparse Autoencoders [8.692675181549117]
Matryoshka SAE (MSAE) learns hierarchical representations at multiple granularities simultaneously.<n>MSAE establishes a new state-of-the-art frontier between reconstruction quality and sparsity for CLIP.
arXiv Detail & Related papers (2025-02-27T22:39:13Z)
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z)
Improving Dictionary Learning with Gated Sparse Autoencoders [8.3037652157611]
Gated Sparse Autoencoder (Gated SAE) is a technique for unsupervised discovery of interpretable features in language models' (LMs) activations. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage. In training SAEs on LMs of up to 7B parameters, Gated SAEs solve shrinkage, and require half as many firing features to achieve comparable reconstruction fidelity.
arXiv Detail & Related papers (2024-04-24T17:47:22Z)
Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in Neural Networks [66.76034024335833]
We investigate why diverse/ complex features are learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features. We propose Feature Reconstruction Regularizer (FRR) to ensure that the learned features can be reconstructed back from the logits. We demonstrate up to 15% gains in OOD accuracy on the recently introduced semi-synthetic datasets with extreme distribution shifts.
arXiv Detail & Related papers (2022-10-04T04:01:15Z)
Fuzzy Attention Neural Network to Tackle Discontinuity in Airway Segmentation [67.19443246236048]
Airway segmentation is crucial for the examination, diagnosis, and prognosis of lung diseases. Some small-sized airway branches (e.g., bronchus and terminaloles) significantly aggravate the difficulty of automatic segmentation. This paper presents an efficient method for airway segmentation, comprising a novel fuzzy attention neural network and a comprehensive loss function.
arXiv Detail & Related papers (2022-09-05T16:38:13Z)
Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.