IC-Cache: Efficient Large Language Model Serving via In-context Caching
- URL: http://arxiv.org/abs/2501.12689v3
- Date: Thu, 04 Sep 2025 06:20:55 GMT
- Title: IC-Cache: Efficient Large Language Model Serving via In-context Caching
- Authors: Yifan Yu, Yu Gan, Nikhil Sarda, Lillian Tsai, Jiaming Shen, Yanqi Zhou, Arvind Krishnamurthy, Fan Lai, Henry M. Levy, David Culler,
- Abstract summary: IC-Cache is a caching system that enables live capability augmentation to improve serving efficiency.<n>We show that IC-Cache improves LLM serving throughput by 1.4-5.9x and reduces latency by 28-71% without hurting response quality.
- Score: 16.75800945078601
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 70% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge transfer among requests. However, naively caching and reusing past responses leads to a big quality drop. In this paper, we introduce IC-Cache, a caching system that enables live LLM capability augmentation to improve serving efficiency: by leveraging historical request-response pairs from larger models as in-context examples, IC-Cache empowers small LLMs to imitate and even exceed the compositional abilities (e.g., reasoning) of their larger counterparts, enabling selective offloading of requests to reduce cost and latency. Achieving this live augmentation at scale introduces intricate trade-offs between response quality, latency, and system throughput. For a new request, IC-Cache efficiently selects similar, high-utility examples to prepend them to the new request's input. At scale, it adaptively routes requests across LLMs of varying capabilities, accounting for response quality and serving loads. IC-Cache employs a cost-aware cache replay mechanism that refines example quality offline to maximize online cache utility and efficiency. Evaluations on millions of realistic requests demonstrate that IC-Cache improves LLM serving throughput by 1.4-5.9x and reduces latency by 28-71% without hurting response quality.
Related papers
- SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching [75.02865981328509]
Caching reduces computation by reusing previously computed model outputs across timesteps.<n>We propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis.<n>SenCache achieves better visual quality than existing caching methods under similar computational budgets.
arXiv Detail & Related papers (2026-02-27T17:36:09Z) - Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z) - Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts [18.479200918676575]
Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services.<n>However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and privacy concerns.<n>This paper proposes a novel deep reinforcement learning-based routing framework for sustained high-quality LLM services.
arXiv Detail & Related papers (2025-08-01T00:45:15Z) - TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses [1.7079407109348677]
Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency.<n>We present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts.
arXiv Detail & Related papers (2025-07-31T15:50:57Z) - LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z) - HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving [5.698111842478072]
Early-Exit LLMs efficiently navigate this trade-off space by skipping some of the later model layers.
Current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries.
We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time.
Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers.
arXiv Detail & Related papers (2025-04-14T21:30:43Z) - A Generative Caching System for Large Language Models [1.2132389187658934]
Caching has the potential to be of significant benefit for accessing large language models (LLMs)<n>This paper presents a new caching system for improving user experiences with LLMs.<n>A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before.
arXiv Detail & Related papers (2025-03-22T01:17:56Z) - QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation.<n>DiTs come with significant drawbacks, including increased computational and memory costs.<n>We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z) - Smart Routing: Cost-Effective Multi-LLM Serving for Multi-Core AIOS [31.60019342381251]
Existing scheduling frameworks mainly target at latency optimization.
This paper proposes an efficient capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving.
arXiv Detail & Related papers (2025-02-27T22:35:31Z) - MixLLM: Dynamic Routing in Mixed Large Language Models [57.309520357563215]
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency.
We develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment.
arXiv Detail & Related papers (2025-02-09T02:26:15Z) - Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.
One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.
We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z) - InstCache: A Predictive Cache for LLM Serving [6.076957323090607]
Caching techniques offer opportunities to optimize the performance of Large Language Models inference engines.<n>High variability in the content and length of instructions make it rare for identical instructions to recur within a short time window.<n>We propose InstCache, a predictive caching mechanism for LLM serving systems.
arXiv Detail & Related papers (2024-11-21T03:52:41Z) - Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment [56.87031484108484]
Large Language Models (LLMs) are increasingly recognized for their practical applications.
Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs.
By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs.
arXiv Detail & Related papers (2024-11-09T15:12:28Z) - GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching [0.0]
GPT Semantic Cache is a method that leverages semantic caching of query embeddings in in-memory storage (Redis)<n>By storing user queries, our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the Large Language Models.<n>Our experiments demonstrate that GPT Semantic Cache reduces API calls by up to 68.8% across various query categories, with cache hit rates ranging from 61.6% to 68.8%.
arXiv Detail & Related papers (2024-11-08T02:21:19Z) - AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems [49.588316022381385]
We propose a Decoding Acceleration Framework for LLM-based Recommendation (dubbed DARE), with Customized Retrieval Pool to improve retrieval efficiency and Relaxed Verification to increase the acceptance rate of draft tokens.
DARE has been deployed to online advertising scenarios within a large-scale commercial environment, achieving a 3.45x speedup while maintaining the downstream performance.
arXiv Detail & Related papers (2024-08-11T02:31:13Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Llumnix: Dynamic Scheduling for Large Language Model Serving [17.919408899409113]
Inference serving for large language models (LLMs) is the key to unleashing their potential.
We introduce Llumnix, an LLM serving system that reacts to such heterogeneous and unpredictable requests by runtime rescheduling.
We show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5x, and delivers up to 36% cost savings.
arXiv Detail & Related papers (2024-06-05T13:20:18Z) - SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns.
Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.