Related papers: Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

URL: http://arxiv.org/abs/2407.14435v3
Date: Thu, 1 Aug 2024 17:42:04 GMT
Title: Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Authors: Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda,
Abstract summary: We introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations. We show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies.
Score: 4.4110204540437365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

Related papers

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [73.09848497762667]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD) Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance. We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need [0.0]
Sparse autoencoders (SAEs) are widely used for interpreting language model activations. Recent work introduced training SAEs directly with a combination of KL divergence and MSE. We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens.
arXiv Detail & Related papers (2025-03-21T16:15:49Z)
Low-Rank Adapting Models for Sparse Autoencoders [6.932760557251821]
We use low-rank adaptation (LoRA) to finetune the language model itself around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs.
arXiv Detail & Related papers (2025-01-31T18:59:16Z)
SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising paradigm to exploit abundant knowledge acquired during pre-training for tackling sequential tasks. Existing prompt-based and Low-Rank Adaptation-based (LoRA-based) methods often require expanding a prompt/LoRA pool or retaining samples of previous tasks. We propose Scalable Decoupled LoRA (SD-LoRA) for class incremental learning, which continually separates the learning of the magnitude and direction of LoRA components without rehearsal.
arXiv Detail & Related papers (2025-01-22T20:00:41Z)
An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token. We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains. Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z)
SALSA: Speedy ASR-LLM Synchronous Aggregation [40.91241351045586]
We propose SALSA, which couples the decoder layers of the ASR to the LLM decoder, while synchronously advancing both decoders. We evaluate SALSA on 8 low-resource languages in the FLEURS benchmark, yielding substantial WER reductions of up to 38%.
arXiv Detail & Related papers (2024-08-29T14:00:57Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
Large Language Models (LLMs) have demonstrated great potential as generalist assistants. It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
FIRST: Faster Improved Listwise Reranking with Single Token Decoding [56.727761901751194]
First, we introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates. Empirical results demonstrate that FIRST accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark. Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial improvements in retriever recall after relevance feedback.
arXiv Detail & Related papers (2024-06-21T21:27:50Z)
Improving Dictionary Learning with Gated Sparse Autoencoders [8.3037652157611]
Gated Sparse Autoencoder (Gated SAE) is a technique for unsupervised discovery of interpretable features in language models' (LMs) activations. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage. In training SAEs on LMs of up to 7B parameters, Gated SAEs solve shrinkage, and require half as many firing features to achieve comparable reconstruction fidelity.
arXiv Detail & Related papers (2024-04-24T17:47:22Z)
Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition [54.9235160379917]
Stable Distillation is a simple and novel approach for SSL-based continued pre-training. It boosts ASR performance in the target domain where both labeled and unlabeled data are limited.
arXiv Detail & Related papers (2023-12-20T06:02:12Z)
ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness. In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively. Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z)
Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
Does Zero-Shot Reinforcement Learning Exist? [11.741744003560095]
A zero-shot RL agent is an agent that can solve any RL task instantly with no additional planning or learning. This marks a shift from the reward-centric RL paradigm towards "controllable" agents. Strategies for approximate zero-shot RL ave been suggested using successor features (SFs) or forward-backward (FB) representations.
arXiv Detail & Related papers (2022-09-29T16:54:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.