Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression
- URL: http://arxiv.org/abs/2502.14477v1
- Date: Thu, 20 Feb 2025 11:52:36 GMT
- Title: Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression
- Authors: Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang,
- Abstract summary: Efficient Selective Attention (ESA) is a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention.
We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k.
- Score: 40.37188244460367
- License:
- Abstract: Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.
Related papers
- Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.
Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation.
We propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them.
arXiv Detail & Related papers (2024-12-03T08:29:27Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [23.20856449846164]
TokenSelect is a model-agnostic, training-free method for efficient and accurate long-context inference.
A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention and up to 2.28x acceleration in end-to-end latency.
arXiv Detail & Related papers (2024-11-05T07:56:24Z) - Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension [21.729875191721984]
We introduce correlation-aware selection and merging mechanisms to facilitate efficient sparse attention.
We also propose a novel data augmentation technique involving positional encodings to enhance generalization to unseen positions.
Our method achieves 100% accuracy on the passkey task with a context length of 4M and maintains stable perplexity at a 1M context length.
arXiv Detail & Related papers (2024-10-05T15:59:32Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.