Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
- URL: http://arxiv.org/abs/2504.20938v1
- Date: Tue, 29 Apr 2025 17:03:03 GMT
- Title: Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
- Authors: Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, Xipeng Qiu,
- Abstract summary: We propose Low-Rank Sparse Attention (Lorsa) to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components.<n>We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors.<n>Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components.
- Score: 60.95047500466925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. Lorsa is designed to address the challenge of attention superposition to understand attention-mediated interaction between features in different token positions. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction heads, successor heads and attention sink behavior (i.e., heavily attending to the first token). Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components, and lead to consistent findings in many ways. For instance, we discover a comprehensive family of arithmetic-specific Lorsa heads, each corresponding to an atomic operation in Llama-3.1-8B. Automated interpretability analysis indicates that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties, especially for features computed collectively by multiple MHSA heads. We also conduct extensive experiments on architectural design ablation, Lorsa scaling law and error analysis.
Related papers
- On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention.<n>We quantify how tokens interact with contextual information based on their sequential positions.<n>Our framework offers a principled foundation for understanding positional biases in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z) - Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi [0.0]
Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers.<n> vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers.<n>This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism.
arXiv Detail & Related papers (2025-01-22T14:19:48Z) - On the token distance modeling ability of higher RoPE attention dimension [76.55792402912027]
We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies.
We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models.
These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
arXiv Detail & Related papers (2024-10-11T10:47:02Z) - Interpreting Attention Layer Outputs with Sparse Autoencoders [3.201633659481912]
Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability.
In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition.
We show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work.
arXiv Detail & Related papers (2024-06-25T17:43:13Z) - Memorization Capacity of Multi-Head Attention in Transformers [41.63663596609437]
This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize.
Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data.
Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property.
arXiv Detail & Related papers (2023-06-03T05:45:29Z) - Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants.
Standard attention heads learn a rigid mapping between search and retrieval.
We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z) - AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection [8.39479809973967]
Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to previously unseen objects with scarce examples.
Existing methods solve this problem by performing subtasks of classification and localization utilizing a shared component.
We present that a general few-shot detector should consider the explicit decomposition of two subtasks, as well as leveraging information from both of them to enhance feature representations.
arXiv Detail & Related papers (2020-11-30T10:21:32Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.