When Attention Sink Emerges in Language Models: An Empirical View
- URL: http://arxiv.org/abs/2410.10781v1
- Date: Mon, 14 Oct 2024 17:50:28 GMT
- Title: When Attention Sink Emerges in Language Models: An Empirical View
- Authors: Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin,
- Abstract summary: Language Models (LMs) assign significant attention to the first token, even if it is not semantically important.
This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others.
We first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models.
- Score: 39.36282162213973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.
Related papers
- Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks.
We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information.
It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization [97.84156490765457]
Large language models (LLMs) struggle to capture relevant information located in the middle of their input.
This phenomenon has been known as the lost-in-the-middle problem.
We show found-in-the-middle achieves better performance in locating relevant information within a long context.
arXiv Detail & Related papers (2024-06-23T04:35:42Z) - Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs)
We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner.
ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z) - Simple linear attention language models balance the recall-throughput
tradeoff [40.08746299497935]
We propose BASED, a simple architecture combining linear and sliding window attention.
We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points.
arXiv Detail & Related papers (2024-02-28T19:28:27Z) - Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness.
To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z) - Efficient Streaming Language Models with Attention Sinks [72.20260088848987]
StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning.
We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
arXiv Detail & Related papers (2023-09-29T17:59:56Z) - Revisiting Attention Weights as Explanations from an Information
Theoretic Perspective [4.499369811647602]
We show that attention mechanisms have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
Our findings indicate that attention mechanisms do have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
arXiv Detail & Related papers (2022-10-31T12:53:20Z) - Is Sparse Attention more Interpretable? [52.85910570651047]
We investigate how sparsity affects our ability to use attention as an explainability tool.
We find that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention.
We observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.
arXiv Detail & Related papers (2021-06-02T11:42:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.