Related papers: Interpreting Attention Layer Outputs with Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

URL: http://arxiv.org/abs/2406.17759v1
Date: Tue, 25 Jun 2024 17:43:13 GMT
Title: Interpreting Attention Layer Outputs with Sparse Autoencoders
Authors: Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda,
Abstract summary: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work.
Score: 3.201633659481912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and up to 2B parameters. We perform a qualitative study of the features computed by attention layers, and find multiple families: long-range context, short-range context and induction features. We qualitatively study the role of every head in GPT-2 Small, and estimate that at least 90% of the heads are polysemantic, i.e. have multiple unrelated roles. Further, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. For example, we explore the mystery of why models have so many seemingly redundant induction heads, use SAEs to motivate the hypothesis that some are long-prefix whereas others are short-prefix, and confirm this with more rigorous analysis. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit (Wang et al.), validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit. We open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention Output SAEs.

Related papers

SplInterp: Improving our Understanding and Training of Sparse Autoencoders [10.800240155402417]
Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability.<n>There have been recent doubts about the true utility of SAEs.<n>We develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs.
arXiv Detail & Related papers (2025-05-17T04:51:26Z)
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition [60.95047500466925]
We propose Low-Rank Sparse Attention (Lorsa) to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors. Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components.
arXiv Detail & Related papers (2025-04-29T17:03:03Z)
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders [8.1201445044499]
Internal mechanisms behind reasoning in LLMs remain unexplored.<n>We employ Sparse Autoencoders to test our hypothesis.<n>Our work provides the first step towards a mechanistic understanding of reasoning in LLMs.
arXiv Detail & Related papers (2025-03-24T16:54:26Z)
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models [51.53835083483751]
We investigate how large language models perform latent multi-hop reasoning in prompts like "Wolfgang Amadeus Mozart's mother's spouse is" We find that failures often stem from the relation attribute extraction stage, where conflicting logits reduce prediction accuracy. We propose back attention, a novel mechanism that enables lower layers to leverage higher-layer hidden states from different positions during attention computation.
arXiv Detail & Related papers (2025-02-15T15:36:42Z)
Understanding Hidden Computations in Chain-of-Thought Reasoning [0.0]
Chain-of-Thought (CoT) prompting has significantly enhanced the reasoning abilities of large language models. Recent studies have shown that models can still perform complex reasoning tasks even when the CoT is replaced with filler(hidden) characters.
arXiv Detail & Related papers (2024-12-05T18:43:11Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space. We build an open-source pipeline to generate and evaluate natural language explanations for SAE features. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z)
Transcoders Find Interpretable LLM Feature Circuits [1.4254279830438588]
We introduce a novel method for using transcoders to perform circuit analysis through sublayers. We train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability.
arXiv Detail & Related papers (2024-06-17T17:49:00Z)
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models [32.2976613483151]
We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task. We find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers.
arXiv Detail & Related papers (2024-06-13T18:12:01Z)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits. These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z)
Function Vectors in Large Language Models [45.267194267587435]
We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs) Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV)
arXiv Detail & Related papers (2023-10-23T17:55:24Z)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions. Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z)
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small [68.879023473838]
We present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI) To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model.
arXiv Detail & Related papers (2022-11-01T17:08:44Z)
Auto-Parsing Network for Image Captioning and Visual Question Answering [101.77688388554097]
We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption.
arXiv Detail & Related papers (2021-08-24T08:14:35Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.