EPIC: Efficient Position-Independent Caching for Serving Large Language Models
- URL: http://arxiv.org/abs/2410.15332v3
- Date: Tue, 27 May 2025 09:24:50 GMT
- Title: EPIC: Efficient Position-Independent Caching for Serving Large Language Models
- Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie,
- Abstract summary: Caching improves serving performance by reusing Key-Value vectors across requests.<n>Existing context caching requires exact prefixes across requests.<n>We introduce Position-Independent Caching (PIC), which enables modular reuse of KV vectors regardless of prefixes.<n>We also introduce EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning.
- Score: 19.510078997414606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8x improvements in Time-To-First-Token (TTFT) and 7x throughput gains over existing systems, with negligible or no accuracy loss.
Related papers
- Single-pass Adaptive Image Tokenization for Minimum Program Search [75.59409288259151]
We propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass.<n> KARL matches the performance of recent adaptive tokenizers while operating in a single pass.
arXiv Detail & Related papers (2025-07-10T17:59:53Z) - EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse [22.769631685777494]
Cross-request key-value ( KV) cache reuse is a technique that stores and reuses intermediate computations.<n>Infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format.<n>We propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse.
arXiv Detail & Related papers (2025-05-28T02:07:03Z) - Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation [80.69067017594709]
Large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks.<n>We propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time.<n>Our method significantly outperforms standard agentic systems that do not utilize logs.
arXiv Detail & Related papers (2025-05-20T14:14:38Z) - MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving [32.56855948056532]
This paper proposes position-independent caching as a more effective approach for multimodal information management.
We have designed and implemented a caching system, named MPIC, to address both system-level and algorithm-level challenges.
arXiv Detail & Related papers (2025-02-04T03:13:09Z) - EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation [8.757777529568383]
Current RAG systems often struggle when retrieval models fail to rank the most relevant documents.<n>We introduce EXIT, an extractive context compression framework.<n>Our evaluations show that EXIT consistently surpasses existing compression methods.
arXiv Detail & Related papers (2024-12-17T05:38:27Z) - CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)
CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.
Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [68.71450519846081]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching [2.392066774757727]
Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks.<n>These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix.<n>The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests.<n>We propose BatchLLM to address the above problems.
arXiv Detail & Related papers (2024-11-29T05:57:37Z) - Compute Or Load KV Cache? Why Not Both? [6.982874528357836]
Cake is a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel.
Cake achieves on average 2.6x reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods.
arXiv Detail & Related papers (2024-10-04T01:11:09Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference [20.249206904309816]
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information.
This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt.
We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.
arXiv Detail & Related papers (2024-04-23T18:10:42Z) - QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length.
We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z) - ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition [3.659659889927316]
ChunkAttention is a prefix-aware self-attention module for large language models.
It can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime.
Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$times$ compared to the state-of-the-art implementation.
arXiv Detail & Related papers (2024-02-23T09:29:19Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.