Related papers: SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

URL: http://arxiv.org/abs/2511.07572v1
Date: Wed, 12 Nov 2025 01:04:20 GMT
Title: SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs
Authors: Sean P. Fillingham, Andrew Gordon, Peter Lai, Xavier Poncini, David Quarel, Stefan Heimersheim,
Abstract summary: We introduce SCALAR, a benchmark measuring interaction sparsity between SAE features.<n>We compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs.<n>Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.
Score: 0.9121032932730987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mechanistic interpretability aims to decompose neural networks into interpretable features and map their connecting circuits. The standard approach trains sparse autoencoders (SAEs) on each layer's activations. However, SAEs trained in isolation don't encourage sparse cross-layer connections, inflating extracted circuits where upstream features needlessly affect multiple downstream features. Current evaluations focus on individual SAE performance, leaving interaction sparsity unexamined. We introduce SCALAR (Sparse Connectivity Assessment of Latent Activation Relationships), a benchmark measuring interaction sparsity between SAE features. We also propose "Staircase SAEs", using weight-sharing to limit upstream feature duplication across downstream features. Using SCALAR, we compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs. Staircase SAEs improve relative sparsity over TopK SAEs by $59.67\% \pm 1.83\%$ (feedforward) and $63.15\% \pm 1.35\%$ (transformer blocks). JSAEs provide $8.54\% \pm 0.38\%$ improvement over TopK for feedforward layers but cannot train effectively across transformer blocks, unlike Staircase and TopK SAEs which work anywhere in the residual stream. We validate on a $216$K-parameter toy model and GPT-$2$ Small ($124$M), where Staircase SAEs maintain interaction sparsity improvements while preserving feature interpretability. Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.

Related papers

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? [10.871959954490217]
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features.<n>Recent work has introduced multiple SAE variants and successfully scaled them to frontier models.<n>Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features.
arXiv Detail & Related papers (2026-02-15T11:53:55Z)
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders [63.544453925182005]
We train 90 SAEs across three language models and evaluate their interpretability and steering utility.<n>Our analysis reveals only a relatively weak positive association (tau b approx 0.298), indicating that interpretability is an insufficient proxy for steering performance.<n>We propose a novel selection criterion called Delta Token Confidence, which measures how much amplifying a feature changes the next token distribution.
arXiv Detail & Related papers (2025-10-04T04:14:50Z)
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features [10.871959954490217]
Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features.<n>In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features.<n>We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.
arXiv Detail & Related papers (2025-09-26T08:10:52Z)
Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders [6.610766275883306]
It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions.<n>We find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together.<n>This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE.
arXiv Detail & Related papers (2025-05-16T23:30:17Z)
Low-Rank Adapting Models for Sparse Autoencoders [6.932760557251821]
We use low-rank adaptation (LoRA) to finetune the textitlanguage model itself around a previously trained SAE.<n>Our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass.
arXiv Detail & Related papers (2025-01-31T18:59:16Z)
μP$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling [49.25546155981064]
We study the infinite-width limit of neural networks trained with Sharpness Aware Minimization (SAM)<n>Our findings reveal that the dynamics of standard SAM effectively reduce to applying SAM solely in the last layer in wide neural networks.<n>In contrast, we identify a stable parameterization with layerwise scaling, which we call $textitMaximal Update and Perturbation $ ($mu$P$2$), that ensures all layers are both feature learning and effectively perturbed in the limit.
arXiv Detail & Related papers (2024-10-31T16:32:04Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Efficient Dictionary Learning with Switch Sparse Autoencoders [8.577217344304072]
We introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs.<n>Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs.<n>We find that Switch SAEs deliver a substantial improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget.
arXiv Detail & Related papers (2024-10-10T17:59:11Z)
Correlation-Embedded Transformer Tracking: A Single-Branch Framework [69.0798277313574]
We propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. The output features can be directly used for predicting target locations without additional correlation steps.
arXiv Detail & Related papers (2024-01-23T13:20:57Z)
Systematic Investigation of Sparse Perturbed Sharpness-Aware Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes. SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation. In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z)
Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach [132.37966970098645]
One of the popular solutions is Sharpness-Aware Minimization (SAM), which minimizes the change of weight loss when adding a perturbation. In this paper, we propose an efficient effective training scheme coined as Sparse SAM (SSAM), which achieves double overhead of common perturbations. In addition, we theoretically prove that S can converge at the same SAM, i.e., $O(log T/sqrtTTTTTTTTTTTTTTTTT
arXiv Detail & Related papers (2022-10-11T06:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.