Related papers: AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

URL: http://arxiv.org/abs/2510.00404v2
Date: Thu, 02 Oct 2025 17:28:55 GMT
Title: AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
Authors: Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu,
Abstract summary: Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models.<n>We introduce such a framework by unrolling the proximal gradient method for sparse coding.<n>We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK.
Score: 19.58274892471746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) have emerged as powerful techniques for interpretability of large language models (LLMs), aiming to decompose hidden states into meaningful semantic features. While several SAE variants have been proposed, there remains no principled framework to derive SAEs from the original dictionary learning formulation. In this work, we introduce such a framework by unrolling the proximal gradient method for sparse coding. We show that a single-step update naturally recovers common SAE variants, including ReLU, JumpReLU, and TopK. Through this lens, we reveal a fundamental limitation of existing SAEs: their sparsity-inducing regularizers enforce non-negativity, preventing a single feature from representing bidirectional concepts (e.g., male vs. female). This structural constraint fragments semantic axes into separate, redundant features, limiting representational completeness. To address this issue, we propose AbsTopK SAE, a new variant derived from the $\ell_0$ sparsity constraint that applies hard thresholding over the largest-magnitude activations. By preserving both positive and negative activations, AbsTopK uncovers richer, bidirectional conceptual representations. Comprehensive experiments across four LLMs and seven probing and steering tasks show that AbsTopK improves reconstruction fidelity, enhances interpretability, and enables single features to encode contrasting concepts. Remarkably, AbsTopK matches or even surpasses the Difference-in-Mean method, a supervised approach that requires labeled data for each concept and has been shown in prior work to outperform SAEs.

Related papers

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation [65.6201974979119]
We propose SemanticVLA, a novel VLA framework that performs Semantic-Hierarchical Sparsification and Enhancement for Efficient Robotic Manipulation.<n>SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.
arXiv Detail & Related papers (2025-11-13T17:24:37Z)
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z)
Analysis of Variational Sparse Autoencoders [1.675385127117872]
We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability.<n>We introduce the Variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with sampling from learned Gaussian posteriors.<n>Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.
arXiv Detail & Related papers (2025-09-26T23:09:56Z)
What Makes You Unique? Attribute Prompt Composition for Object Re-Identification [70.67907354506278]
Object Re-IDentification aims to recognize individuals across non-overlapping camera views.<n>Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies.<n>We propose an Attribute Prompt Composition framework, which exploits textual semantics to jointly enhance discrimination and generalization.
arXiv Detail & Related papers (2025-09-23T07:03:08Z)
UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation [104.59740403500132]
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance.<n>We propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC)<n>Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels.
arXiv Detail & Related papers (2025-09-19T17:29:25Z)
Evaluating Sparse Autoencoders for Monosemantic Representation [7.46972338257749]
A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts.<n>Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features.<n>This paper provides the first systematic evaluation of SAEs against base models concerning monosemanticity.
arXiv Detail & Related papers (2025-08-20T22:08:01Z)
Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z)
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z)
Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability [54.420663939897686]
We propose the Attribute-formed Language Bottleneck Model (ALBM) to achieve interpretable image recognition.<n>ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes.<n>To further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes.
arXiv Detail & Related papers (2025-03-26T07:59:04Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Interpreting CLIP with Hierarchical Sparse Autoencoders [8.692675181549117]
Matryoshka SAE (MSAE) learns hierarchical representations at multiple granularities simultaneously.<n>MSAE establishes a new state-of-the-art frontier between reconstruction quality and sparsity for CLIP.
arXiv Detail & Related papers (2025-02-27T22:39:13Z)
SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders [7.618223798662929]
We propose SA-DVAE -- Semantic Alignment via Disentangled Variational Autoencoders. We implement this idea via a pair of modality-specific variational autoencoders coupled with a total correction penalty. Experiments show that SA-DAVE produces improved performance over existing methods.
arXiv Detail & Related papers (2024-07-18T12:35:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.