Prompt Cache: Modular Attention Reuse for Low-Latency Inference
- URL: http://arxiv.org/abs/2311.04934v2
- Date: Thu, 25 Apr 2024 15:45:19 GMT
- Title: Prompt Cache: Modular Attention Reuse for Low-Latency Inference
- Authors: In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong,
- Abstract summary: We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different prompts.
Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules.
We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts.
- Score: 12.610067639587461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.
Related papers
- Auditing Prompt Caching in Language Model APIs [77.02079451561718]
We investigate the privacy leakage caused by prompt caching in large language models (LLMs)
We detect global cache sharing across users in seven API providers, including OpenAI.
We find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
arXiv Detail & Related papers (2025-02-11T18:58:04Z) - Adaptive Semantic Prompt Caching with VectorQ [78.59891542553179]
Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache.
We show that this one-size-fits-all threshold is insufficient across different prompts.
We propose VectorQ, a framework to learn embedding-specific threshold regions that adapt to the complexity and uncertainty of an embedding.
arXiv Detail & Related papers (2025-02-06T04:16:20Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference [2.3587921104010756]
We propose BUZZ, a novel KV caching algorithm to minimize cache memory usage while enhancing inference speed.
BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information.
We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA.
arXiv Detail & Related papers (2024-10-30T14:53:37Z) - EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models [19.510078997414606]
EPIC introduces position-independent context caching for large language models.
EPIC delivers up to 8x improvements in TTFT and 7x throughput over existing systems.
arXiv Detail & Related papers (2024-10-20T08:42:29Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.