Related papers: Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

URL: http://arxiv.org/abs/2602.14111v1
Date: Sun, 15 Feb 2026 11:53:55 GMT
Title: Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
Authors: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina,
Abstract summary: Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features.<n>Recent work has introduced multiple SAE variants and successfully scaled them to frontier models.<n>Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features.
Score: 10.871959954490217
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9\%$ of true features despite achieving $71\%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

Related papers

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs [0.9121032932730987]
We introduce SCALAR, a benchmark measuring interaction sparsity between SAE features.<n>We compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs.<n>Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.
arXiv Detail & Related papers (2025-11-10T19:31:54Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders [63.544453925182005]
We train 90 SAEs across three language models and evaluate their interpretability and steering utility.<n>Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance.<n>We propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution.
arXiv Detail & Related papers (2025-10-04T04:14:50Z)
Dense SAE Latents Are Features, Not Bugs [86.50389855919292]
We show that dense latents serve functional roles in language model computation.<n>We identify classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction.
arXiv Detail & Related papers (2025-06-18T17:59:35Z)
Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders [6.610766275883306]
It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions.<n>We find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together.<n>This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE.
arXiv Detail & Related papers (2025-05-16T23:30:17Z)
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing [6.836374436707495]
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations.<n>One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines.<n>We test this by applying SAEs to the real-world task of LLM activation probing in four regimes.
arXiv Detail & Related papers (2025-02-23T18:54:15Z)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection.<n>For steering, we find that prompting outperforms all existing methods, followed by finetuning.<n>For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z)
Sparse Autoencoders Trained on the Same Data Learn Different Features [0.7234862895932991]
Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in large language models.<n>Our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features.
arXiv Detail & Related papers (2025-01-28T01:24:16Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Tasty Burgers, Soggy Fries: Probing Aspect Robustness in Aspect-Based Sentiment Analysis [71.40390724765903]
Aspect-based sentiment analysis (ABSA) aims to predict the sentiment towards a specific aspect in the text. Existing ABSA test sets cannot be used to probe whether a model can distinguish the sentiment of the target aspect from the non-target aspects. We generate new examples to disentangle the confounding sentiments of the non-target aspects from the target aspect's sentiment.
arXiv Detail & Related papers (2020-09-16T22:38:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.