Recycled Attention: Efficient inference for long-context language models
- URL: http://arxiv.org/abs/2411.05787v1
- Date: Fri, 08 Nov 2024 18:57:07 GMT
- Title: Recycled Attention: Efficient inference for long-context language models
- Authors: Fangyuan Xu, Tanya Goyal, Eunsol Choi,
- Abstract summary: We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens.
When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens.
Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
- Score: 54.00118604124301
- License:
- Abstract: Generating long sequences of tokens given a long-context input imposes a heavy computational burden for large language models (LLMs). One of the computational bottleneck comes from computing attention over a long sequence of input at each generation step. In this paper, we propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens, reducing the cost of data movement and attention computation. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step. We evaluate our methods on RULER, a suite of tasks designed to comprehensively evaluate long-context abilities, and long-context language modeling tasks. Applying our method to off-the-shelf LLMs achieves comparable speedup to baselines which only consider local context while improving the performance by 2x. We further explore two ideas to improve performance-efficiency trade-offs: (1) dynamically decide when to perform recycled or full attention step based on the query similarities and (2) continued pre-training the model with Recycled Attention.
Related papers
- Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.
Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs [10.52833484759311]
We propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism.
It dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget.
We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup.
arXiv Detail & Related papers (2025-02-17T08:39:43Z) - AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach.
AttentionPredictor accurately predicts the attention score while consuming negligible memory.
We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.
Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [23.20856449846164]
TokenSelect is a model-agnostic, training-free method for efficient and accurate long-context inference.
A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention and up to 2.28x acceleration in end-to-end latency.
arXiv Detail & Related papers (2024-11-05T07:56:24Z) - CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs [8.649971923487835]
We propose CritiPrefill, a criticality-based segment-wise prefilling method for long-context processing.
CritiPrefill partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality.
Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU.
arXiv Detail & Related papers (2024-09-19T06:09:56Z) - Efficient Streaming Language Models with Attention Sinks [72.20260088848987]
StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning.
We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
arXiv Detail & Related papers (2023-09-29T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.