Improving Dictionary Learning with Gated Sparse Autoencoders
- URL: http://arxiv.org/abs/2404.16014v2
- Date: Tue, 30 Apr 2024 17:54:04 GMT
- Title: Improving Dictionary Learning with Gated Sparse Autoencoders
- Authors: Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda,
- Abstract summary: Gated Sparse Autoencoder (Gated SAE) is a technique for unsupervised discovery of interpretable features in language models' (LMs) activations.
In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage.
In training SAEs on LMs of up to 7B parameters, Gated SAEs solve shrinkage, and require half as many firing features to achieve comparable reconstruction fidelity.
- Score: 8.3037652157611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.
Related papers
- PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead [24.611413814466978]
Large language models (LLMs) enhanced with retrieval-augmented generation (RAG) have introduced a new paradigm for web search.
Existing methods to enhance context awareness are often inefficient, incurring time or memory overhead during inference.
We propose Position-Embedding-Agnostic attention Re-weighting (PEAR) which enhances the context awareness of LLMs with zero inference overhead.
arXiv Detail & Related papers (2024-09-29T15:40:54Z) - Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders [4.4110204540437365]
We introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations.
We show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies.
arXiv Detail & Related papers (2024-07-19T16:07:19Z) - Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
Large Language Models (LLMs) have demonstrated great potential as generalist assistants.
It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts.
In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
arXiv Detail & Related papers (2024-07-11T17:52:03Z) - Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [61.916827858666906]
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer.
To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback.
Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails.
This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
arXiv Detail & Related papers (2024-03-01T03:29:54Z) - Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models [9.688626139309013]
Retrieval-Augmented Generation is considered as a means to improve the trustworthiness of text generation from large language models.
In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers.
We introduce a novel optimization technique called Gradient Guided Prompt Perturbation.
arXiv Detail & Related papers (2024-02-11T12:25:41Z) - Self-RAG: Learning to Retrieve, Generate, and Critique through
Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection.
It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z) - Meta-Learning Adversarial Bandit Algorithms [55.72892209124227]
We study online meta-learning with bandit feedback.
We learn to tune online mirror descent generalization (OMD) with self-concordant barrier regularizers.
arXiv Detail & Related papers (2023-07-05T13:52:10Z) - Learning Common Rationale to Improve Self-Supervised Representation for
Fine-Grained Visual Recognition Problems [61.11799513362704]
We propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes.
We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective.
arXiv Detail & Related papers (2023-03-03T02:07:40Z) - Meta-Learning Adversarial Bandits [49.094361442409785]
We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure.
As the first to target the adversarial setting, we design a meta-algorithm that setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit optimization (BLO)
Our guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-B sequence.
arXiv Detail & Related papers (2022-05-27T17:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.