TopK Language Models
- URL: http://arxiv.org/abs/2506.21468v1
- Date: Thu, 26 Jun 2025 16:56:43 GMT
- Title: TopK Language Models
- Authors: Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling,
- Abstract summary: TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability.<n>These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts.
- Score: 23.574227495324568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.
Related papers
- Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z) - Model Unlearning via Sparse Autoencoder Subspace Guided Projections [34.47648738350138]
Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns.<n>Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder steering, either lack interpretability or fail to provide a robust defense against adversarial prompts.<n>We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space.
arXiv Detail & Related papers (2025-05-30T10:07:52Z) - Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z) - Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z) - Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders [1.0582505915332336]
It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions.<n>We find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together.<n>This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE.
arXiv Detail & Related papers (2025-05-16T23:30:17Z) - Steering CLIP's vision transformer with sparse autoencoders [20.63298721008492]
We train sparse autoencoders (SAEs) on CLIP's vision transformer to uncover key differences between vision and language processing.<n>We find that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model.
arXiv Detail & Related papers (2025-04-11T17:56:09Z) - Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
We introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in vision representations.<n>Our experimental results reveal that SAEs trained on Vision-Language Models significantly enhance the monosemanticity of individual neurons.
arXiv Detail & Related papers (2025-04-03T17:58:35Z) - Understanding the Role of Equivariance in Self-supervised Learning [51.56331245499712]
equivariant self-supervised learning (E-SSL) learns features to be augmentation-aware.
We identify a critical explaining-away effect in E-SSL that creates a synergy between the equivariant and classification tasks.
We reveal several principles for practical designs of E-SSL.
arXiv Detail & Related papers (2024-11-10T16:09:47Z) - Tuning-Free Accountable Intervention for LLM Deployment -- A
Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks.
We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z) - RoAST: Robustifying Language Models via Adversarial Perturbation with
Selective Training [105.02614392553198]
We propose Robustifying LMs via Adversarial perturbation with Selective Training (RoAST)
RoAST incorporates two important sources for the model robustness, robustness on the perturbed inputs and generalizable knowledge in pre-trained LMs.
We demonstrate the effectiveness of RoAST compared to state-of-the-art fine-tuning methods on six different types of LMs.
arXiv Detail & Related papers (2023-12-07T04:23:36Z) - Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs [47.14410674505256]
We present a case study of syntax acquisition in masked language models (MLMs)<n>We study Syntactic Attention Structure (SAS), a naturally emerging property of accessibles wherein specific Transformer heads tend to focus on specific syntactic relations.<n>We examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities.
arXiv Detail & Related papers (2023-09-13T20:57:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.