Related papers: Distribution-Aware Feature Selection for SAEs

Distribution-Aware Feature Selection for SAEs

URL: http://arxiv.org/abs/2508.21324v1
Date: Fri, 29 Aug 2025 04:42:17 GMT
Title: Distribution-Aware Feature Selection for SAEs
Authors: Narmeen Oozeer, Nirmalendu Prakash, Michael Lan, Alice Rigg, Amirali Abdullah,
Abstract summary: TopK SAE reconstructs each token from its K most active latents.<n> BatchTopK addresses this limitation by selecting top activations across a batch of tokens.<n>This improves average reconstruction but risks an "activation lottery"
Score: 1.2396474483677118
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry more information than others. BatchTopK addresses this limitation by selecting top activations across a batch of tokens. This improves average reconstruction but risks an "activation lottery," where rare high-magnitude features crowd out more informative but lower-magnitude ones. To address this issue, we introduce Sampled-SAE: we score the columns (representing features) of the batch activation matrix (via $L_2$ norm or entropy), forming a candidate pool of size $Kl$, and then apply Top-$K$ to select tokens across the batch from the restricted pool of features. Varying $l$ traces a spectrum between batch-level and token-specific selection. At $l=1$, tokens draw only from $K$ globally influential features, while larger $l$ expands the pool toward standard BatchTopK and more token-specific features across the batch. Small $l$ thus enforces global consistency; large $l$ favors fine-grained reconstruction. On Pythia-160M, no single value optimizes $l$ across all metrics: the best choice depends on the trade-off between shared structure, reconstruction fidelity, and downstream performance. Sampled-SAE thus reframes BatchTopK as a tunable, distribution-aware family.

Related papers

Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing [52.124267908936396]
The model is composed of $M$ arms and $K$ plays.<n>Each arm has a number of capacities, and each unit of capacity is associated with a reward function.<n>When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner.
arXiv Detail & Related papers (2025-12-25T11:19:09Z)
Route Experts by Sequence, not by Token [58.92918003265283]
Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token.<n>The standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity.<n>We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level.
arXiv Detail & Related papers (2025-11-09T18:36:07Z)
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z)
Foundations of Top-$k$ Decoding For Language Models [19.73575905188064]
We develop a theoretical framework that both explains and generalizes top-$k$ decoding.<n>We show how to optimize it efficiently for a large class of divergences.
arXiv Detail & Related papers (2025-05-25T23:46:34Z)
HashAttention: Semantic Sparsity for Faster Inference [95.31739930718116]
This paper introduces HashAttention, framing pivotal token identification as a recommendation problem.<n>It reduces tokens used by up to $16times$ with minimal quality loss, requiring only 32 bits of auxiliary memory per token.<n>On A100 GPU, at $32times$ sparsity, incorporating HashAttention reduces attention latency by up to $4.3times$ in GPT-FAST and $2.54times$ in FlashDecode, and achieves up to $3.12times$ higher throughput for GPT-FAST.
arXiv Detail & Related papers (2024-12-19T02:34:15Z)
BatchTopK Sparse Autoencoders [1.8754113193437074]
BatchTopK is a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level.<n>We show that BatchTopK SAEs consistently outperform TopK SAEs in reconstructing activations from GPT-2 Small and Gemma 2 2B.
arXiv Detail & Related papers (2024-12-09T11:39:00Z)
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks. We propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels.
arXiv Detail & Related papers (2024-11-04T14:36:24Z)
Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z)
Tokenization and the Noiseless Channel [71.25796813073399]
Good tokenizers lead to emphefficient channel usage, where the channel is the means by which some input is conveyed to the model. In machine translation, we find that across multiple tokenizers, the R'enyi entropy with $alpha = 2.5$ has a very strong correlation with textscBleu: $0.78$ in comparison to just $-0.32$ for compressed length.
arXiv Detail & Related papers (2023-06-29T10:32:09Z)
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z)
Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB) We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.