Related papers: OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

URL: http://arxiv.org/abs/2509.22033v1
Date: Fri, 26 Sep 2025 08:10:52 GMT
Title: OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, Ivan Oseledets,
Abstract summary: Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features.<n>In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features.<n>We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
Score: 10.871959954490217
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

Related papers

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs [0.9121032932730987]
We introduce SCALAR, a benchmark measuring interaction sparsity between SAE features.<n>We compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs.<n>Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.
arXiv Detail & Related papers (2025-11-10T19:31:54Z)
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z)
Understanding sparse autoencoder scaling in the presence of feature manifolds [5.2924382061650395]
We adapt a capacity-allocation model from the neural scaling literature to understand SAE scaling.<n>We discuss whether SAEs are in a pathological regime in the wild.
arXiv Detail & Related papers (2025-09-02T17:59:50Z)
Dense SAE Latents Are Features, Not Bugs [75.08462524662072]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z)
Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders [6.610766275883306]
It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions.<n>We find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together.<n>This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE.
arXiv Detail & Related papers (2025-05-16T23:30:17Z)
Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning [83.10178754323955]
Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network is proposed to solve the problem of complex interactions between attributes and object visual representations.<n>To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module.<n>To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module.<n>The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.
arXiv Detail & Related papers (2024-11-28T09:50:25Z)
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders [8.003244901104111]
We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
arXiv Detail & Related papers (2024-11-02T11:42:23Z)
Efficient Dictionary Learning with Switch Sparse Autoencoders [8.577217344304072]
We introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs.<n>Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs.<n>We find that Switch SAEs deliver a substantial improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget.
arXiv Detail & Related papers (2024-10-10T17:59:11Z)
Adaptive Feature Selection for No-Reference Image Quality Assessment by Mitigating Semantic Noise Sensitivity [55.399230250413986]
We propose a Quality-Aware Feature Matching IQA Metric (QFM-IQM) to remove harmful semantic noise features from the upstream task.<n>Our approach achieves superior performance to the state-of-the-art NR-IQA methods on eight standard IQA datasets.
arXiv Detail & Related papers (2023-12-11T06:50:27Z)
Spatio-temporal Gait Feature with Adaptive Distance Alignment [90.5842782685509]
We try to increase the difference of gait features of different subjects from two aspects: the optimization of network structure and the refinement of extracted gait features. Our method is proposed, it consists of Spatio-temporal Feature Extraction (SFE) and Adaptive Distance Alignment (ADA) ADA uses a large number of unlabeled gait data in real life as a benchmark to refine the extracted-temporal features to make them have low inter-class similarity and high intra-class similarity.
arXiv Detail & Related papers (2022-03-07T13:34:00Z)
Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR. Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z)
Improving Aspect-Level Sentiment Analysis with Aspect Extraction [104.3459510527776]
The work primarily hypothesizes that transferring knowledge from a pre-trained AE model can benefit the performance of ALSA models. Empirically, this work show that the added information significantly improves the performance of three different baseline ALSA models.
arXiv Detail & Related papers (2020-05-03T06:25:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.