Related papers: RefreshKV: Updating Small KV Cache During Long-form Generation

RefreshKV: Updating Small KV Cache During Long-form Generation

URL: http://arxiv.org/abs/2411.05787v2
Date: Mon, 03 Mar 2025 18:23:47 GMT
Title: RefreshKV: Updating Small KV Cache During Long-form Generation
Authors: Fangyuan Xu, Tanya Goyal, Eunsol Choi,
Abstract summary: We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
Score: 54.00118604124301
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is to construct a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks. Lastly, we show that continued pretraining with our inference setting brings further gains in performance.

Related papers

Efficient Pretraining Length Scaling [21.4715211093876]
We present the Parallel Hidden Decoding Transformer (textitPHD-Transformer), a novel framework that enables efficient length scaling during pre-training. textitPHD-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens.
arXiv Detail & Related papers (2025-04-21T09:41:26Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection [23.20856449846164]
TokenSelect is a model-agnostic, training-free method for efficient and accurate long-context inference. A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention and up to 2.28x acceleration in end-to-end latency.
arXiv Detail & Related papers (2024-11-05T07:56:24Z)
Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
We identify and characterize a neglected deficiency in existing long-sequence recommendation models. A single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. We propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are learned separately to fully decouple attention and representation.
arXiv Detail & Related papers (2024-10-03T15:45:15Z)
CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs [8.649971923487835]
We propose CritiPrefill, a criticality-based segment-wise prefilling method for long-context processing. CritiPrefill partitions the input sequence's queries and KV cache into segments and blocks, utilizing a segment-wise algorithm to estimate the query criticality. Extensive evaluations on multiple long-context datasets show up to 2.7x speedup on Llama3-8B and 3.0x speedup on Yi-9B for 128K context length on a single A100 GPU.
arXiv Detail & Related papers (2024-09-19T06:09:56Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation. In this work, we focus on developing an efficient compression technique for the KV cache. We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)
Efficient Streaming Language Models with Attention Sinks [72.20260088848987]
StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
arXiv Detail & Related papers (2023-09-29T17:59:56Z)
Generic Attention-model Explainability by Weighted Relevance Accumulation [9.816810016935541]
We propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks.
arXiv Detail & Related papers (2023-08-20T12:02:30Z)
Learning Second-Order Attentive Context for Efficient Correspondence Pruning [22.100653202605965]
Correspondence pruning aims to search consistent correspondences (inliers) from a set of putative correspondences. In this paper, we propose an effective and efficient method for correspondence pruning.
arXiv Detail & Related papers (2023-03-28T06:40:11Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.