Stateful Large Language Model Serving with Pensieve
- URL: http://arxiv.org/abs/2312.05516v2
- Date: Tue, 28 May 2024 04:34:37 GMT
- Title: Stateful Large Language Model Serving with Pensieve
- Authors: Lingfan Yu, Jinyang Li,
- Abstract summary: Pensieve is a system optimized for multi-turn conversation LLM serving.
Pensieve maintains the conversation state across requests by caching previously processed history to avoid duplicate processing.
- Score: 2.942499535977391
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are wildly popular today and it is important to serve them efficiently. Existing LLM serving systems are stateless across requests. Consequently, when LLMs are used in the common setting of multi-turn conversations, a growing log of the conversation history must be processed alongside any request by the serving system at each turn, resulting in repeated processing. In this paper, we design Pensieve, a system optimized for multi-turn conversation LLM serving. Pensieve maintains the conversation state across requests by caching previously processed history to avoid duplicate processing. Pensieve's multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data. Pensieve also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that Pensieve can achieve 13-58% more throughput compared to vLLM and TensorRT-LLM and significantly reduce latency.
Related papers
- A Generative Caching System for Large Language Models [1.2132389187658934]
Caching has the potential to be of significant benefit for accessing large language models (LLMs)
This paper presents a new caching system for improving user experiences with LLMs.
A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before.
arXiv Detail & Related papers (2025-03-22T01:17:56Z) - GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching [0.0]
GPT Semantic Cache is a method that leverages semantic caching of query embeddings in in-memory storage (Redis)
Our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the Large Language Models.
This technique reduces operational costs and improves response times, enhancing the efficiency of LLM-powered applications.
arXiv Detail & Related papers (2024-11-08T02:21:19Z) - BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference [2.3587921104010756]
We propose BUZZ, a novel KV caching algorithm to minimize cache memory usage while enhancing inference speed.
BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information.
We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA.
arXiv Detail & Related papers (2024-10-30T14:53:37Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Optimizing LLM Queries in Relational Workloads [58.254894049950366]
We show how to optimize Large Language Models (LLMs) inference for analytical workloads that invoke LLMs within relational queries.
We implement these optimizations in Apache Spark, with vLLM as the model serving backend.
We achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets.
arXiv Detail & Related papers (2024-03-09T07:01:44Z) - Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization [7.674972936853123]
We investigate whether combining the queries for the same input context in a single prompt to minimize repeated calls can be successfully used in meeting summarization.
We observe that 100% reliability in generating the response in the expected format is usually limited to certain closed-source LLMs.
arXiv Detail & Related papers (2024-02-29T19:00:47Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - Efficient Memory Management for Large Language Model Serving with
PagedAttention [44.70922552274376]
High throughput serving of large language models (LLMs) requires sufficiently many requests at a time.
Existing systems struggle because the key-value cache ( KV cache) memory for each request is huge and grows and shrinks dynamically.
We propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.
arXiv Detail & Related papers (2023-09-12T12:50:04Z) - Recursively Summarizing Enables Long-Term Dialogue Memory in Large
Language Models [75.98775135321355]
Given a long conversation, large language models (LLMs) fail to recall past information and tend to generate inconsistent responses.
We propose to generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability.
arXiv Detail & Related papers (2023-08-29T04:59:53Z) - H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models [110.06476624089679]
We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
arXiv Detail & Related papers (2023-06-24T20:11:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.