Interpreting Attention Layer Outputs with Sparse Autoencoders
- URL: http://arxiv.org/abs/2406.17759v1
- Date: Tue, 25 Jun 2024 17:43:13 GMT
- Title: Interpreting Attention Layer Outputs with Sparse Autoencoders
- Authors: Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda,
- Abstract summary: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability.
In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition.
We show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work.
- Score: 3.201633659481912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and up to 2B parameters. We perform a qualitative study of the features computed by attention layers, and find multiple families: long-range context, short-range context and induction features. We qualitatively study the role of every head in GPT-2 Small, and estimate that at least 90% of the heads are polysemantic, i.e. have multiple unrelated roles. Further, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. For example, we explore the mystery of why models have so many seemingly redundant induction heads, use SAEs to motivate the hypothesis that some are long-prefix whereas others are short-prefix, and confirm this with more rigorous analysis. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit (Wang et al.), validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit. We open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention Output SAEs.
Related papers
- Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models [51.53835083483751]
We investigate how large language models perform latent multi-hop reasoning in prompts like "Wolfgang Amadeus Mozart's mother's spouse is"
We find that failures often stem from the relation attribute extraction stage, where conflicting logits reduce prediction accuracy.
We propose back attention, a novel mechanism that enables lower layers to leverage higher-layer hidden states from different positions during attention computation.
arXiv Detail & Related papers (2025-02-15T15:36:42Z) - Transformers Use Causal World Models in Maze-Solving Tasks [49.67445252528868]
We investigate the inner workings of transformer models trained on tasks across various domains.
We find that transformers are able to reason with respect to a greater number of active features than they see during training.
We observe that varying positional encodings can alter how WMs are encoded in a model's residual stream.
arXiv Detail & Related papers (2024-12-16T15:21:04Z) - Understanding Hidden Computations in Chain-of-Thought Reasoning [0.0]
Chain-of-Thought (CoT) prompting has significantly enhanced the reasoning abilities of large language models.
Recent studies have shown that models can still perform complex reasoning tasks even when the CoT is replaced with filler(hidden) characters.
arXiv Detail & Related papers (2024-12-05T18:43:11Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.
We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.
Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z) - Talking Heads: Understanding Inter-layer Communication in Transformer Language Models [32.2976613483151]
We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task.
We find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers.
arXiv Detail & Related papers (2024-06-13T18:12:01Z) - Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.
These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z) - Function Vectors in Large Language Models [45.267194267587435]
We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs)
Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV)
arXiv Detail & Related papers (2023-10-23T17:55:24Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.