Related papers: Residual Stream Analysis with Multi-Layer SAEs

Residual Stream Analysis with Multi-Layer SAEs

URL: http://arxiv.org/abs/2409.04185v3
Date: Mon, 24 Feb 2025 09:18:36 GMT
Title: Residual Stream Analysis with Multi-Layer SAEs
Authors: Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison,
Abstract summary: We introduce the multi-layer SAE (MLSAE), a single SAE trained on the residual stream activation vectors from every transformer layer.<n>We find that individual latents are often active at a single layer for a given token or prompt, but the layer at which an individual latent is active may differ for different tokens or prompts.<n>Our results represent a new approach to understanding how representations change as they flow through transformers.
Score: 21.142967037533175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer. Given that the residual stream is understood to preserve information across layers, we expected MLSAE latents to 'switch on' at a token position and remain active at later layers. Interestingly, we find that individual latents are often active at a single layer for a given token or prompt, but the layer at which an individual latent is active may differ for different tokens or prompts. We quantify these phenomena by defining a distribution over layers and considering its variance. We find that the variance of the distributions of latent activations over layers is about two orders of magnitude greater when aggregating over tokens compared with a single token. For larger underlying models, the degree to which latents are active at multiple layers increases, which is consistent with the fact that the residual stream activation vectors at adjacent layers become more similar. Finally, we relax the assumption that the residual stream basis is the same at every layer by applying pre-trained tuned-lens transformations, but our findings remain qualitatively similar. Our results represent a new approach to understanding how representations change as they flow through transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.

Related papers

Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws [42.926341529639274]
Transformers deliver outstanding performance across a wide range of tasks.<n>Their task-solving performance is improved by increasing parameter size.<n>This study focuses on layers and their size, which mainly decide the parameter size of Transformers.
arXiv Detail & Related papers (2025-05-29T21:13:31Z)
Intermediate Layer Classifiers for OOD generalization [17.13749013546228]
In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation. We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer.
arXiv Detail & Related papers (2025-04-07T19:50:50Z)
Adaptive Layer-skipping in Pre-trained LLMs [27.938188248731038]
FlexiDepth is a method that dynamically adjusts the number of Transformer layers used in text generation. By incorporating a plug-in router and adapter, FlexiDepth enables adaptive layer-skipping in large language models.
arXiv Detail & Related papers (2025-03-31T07:20:58Z)
Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video) We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Value Residual Learning For Alleviating Attention Concentration In Transformers [14.898656879574622]
stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. We propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers.
arXiv Detail & Related papers (2024-10-23T14:15:07Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs [63.29737699997859]
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation.
arXiv Detail & Related papers (2024-05-26T21:31:59Z)
Manifold-Preserving Transformers are Effective for Short-Long Range Encoding [39.14128923434994]
Multi-head self-attention-based Transformers have shown promise in different learning tasks. We propose TransJect, an encoder model that guarantees a theoretical bound for layer-wise distance preservation between a pair of tokens.
arXiv Detail & Related papers (2023-10-22T06:58:28Z)
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention [36.737750120893516]
We propose Joint/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformers. JoMA predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations. We leverage JoMA to explain how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model.
arXiv Detail & Related papers (2023-10-01T01:21:35Z)
Learnable Polyphase Sampling for Shift Invariant and Equivariant Convolutional Networks [120.78155051439076]
LPS can be trained end-to-end from data and generalizes existing handcrafted downsampling layers. We evaluate LPS on image classification and semantic segmentation.
arXiv Detail & Related papers (2022-10-14T17:59:55Z)
Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency [31.572652956170252]
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
arXiv Detail & Related papers (2021-04-08T08:21:59Z)
Transformer Feed-Forward Layers Are Key-Value Memories [49.52087581977751]
We show that feed-forward layers in transformer-based language models operate as key-value memories. We show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones.
arXiv Detail & Related papers (2020-12-29T19:12:05Z)
Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks. We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator. To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.