Related papers: LongHeads: Multi-Head Attention is Secretly a Long Context Processor

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

URL: http://arxiv.org/abs/2402.10685v2
Date: Mon, 25 Mar 2024 11:50:32 GMT
Title: LongHeads: Multi-Head Attention is Secretly a Long Context Processor
Authors: Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract summary: LongHeads is a training-free framework that enhances large language models' long context ability. Instead of allowing each head to attend to the full sentence, we allow each head to process in-distribution length by selecting and attending to context chunks. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task.
Score: 49.1661870007655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention's quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM's long context ability by unlocking multi-head attention's untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads's efficacy in extending the usable context window for existing models. We release our code at https://github.com/LuLuLuyi/LongHeads .

Related papers

SELF: Self-Extend the Context Length With Logistic Growth Function [24.523942828913405]
We propose SELF: a solution of grouping tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances.<n>Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval.<n>On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM.
arXiv Detail & Related papers (2025-05-22T21:23:20Z)
LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models. Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges. We enhance the performance of speculative decoding in long-context settings by addressing these challenges.
arXiv Detail & Related papers (2025-02-24T18:53:31Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? [36.83397306207386]
We evaluate the capabilities of 17 leading Large Language Models (LLMs) Strikingly, many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. We find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows.
arXiv Detail & Related papers (2024-11-07T18:59:27Z)
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models [72.71150585370147]
LongRecipe is an efficient training strategy for extending the context window of large language models. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training.
arXiv Detail & Related papers (2024-08-31T17:19:30Z)
FocusLLM: Scaling LLM's Context by Parallel Decoding [16.642675785000176]
FocusLLM is a framework designed to extend the context length of any decoder-only LLM. FocusLLM processes long text inputs by dividing them into chunks based on the model's original context length. It appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism.
arXiv Detail & Related papers (2024-08-21T16:11:59Z)
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM. InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention. Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z)
LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment. We construct a long instruction-following dataset using Self-Instruct. We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.