Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification
- URL: http://arxiv.org/abs/2502.09647v1
- Date: Tue, 11 Feb 2025 00:04:32 GMT
- Title: Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification
- Authors: Konstantin Donhauser, Charles Arnal, Mohammad Pezeshki, Vivien Cabannes, David Lopez-Paz, Kartik Ahuja,
- Abstract summary: We show that attention heads swing between attending to local and long-context information depending on the query.
We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys.
- Score: 20.49185921960757
- License:
- Abstract: The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it's possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.
Related papers
- Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.
Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - On the token distance modeling ability of higher RoPE attention dimension [76.55792402912027]
We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies.
We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models.
These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
arXiv Detail & Related papers (2024-10-11T10:47:02Z) - Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP [32.19010113355365]
We argue that conflating different tasks by their context length is unproductive.
We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts.
We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored.
arXiv Detail & Related papers (2024-06-29T11:09:47Z) - Retrieval Head Mechanistically Explains Long-Context Factuality [56.78951509492645]
We show that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads.
We show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context.
We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
arXiv Detail & Related papers (2024-04-24T00:24:03Z) - LongHeads: Multi-Head Attention is Secretly a Long Context Processor [49.1661870007655]
LongHeads is a training-free framework that enhances large language models' long context ability.
Instead of allowing each head to attend to the full sentence, we allow each head to process in-distribution length by selecting and attending to context chunks.
LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task.
arXiv Detail & Related papers (2024-02-16T13:39:34Z) - Attention Sorting Combats Recency Bias In Long Context Language Models [69.06809365227504]
Current language models often fail to incorporate long contexts efficiently during generation.
We show that a major contributor to this issue are attention priors that are likely learned during pre-training.
We leverage this fact to introduce attention sorting'': perform one step of decoding, sort documents by the attention they receive, repeat the process, generate the answer with the newly sorted context.
arXiv Detail & Related papers (2023-09-28T05:19:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.