Related papers: MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services

MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services

URL: http://arxiv.org/abs/2403.02694v3
Date: Mon, 15 Jul 2024 22:33:58 GMT
Title: MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services
Authors: Waris Gill, Mohamed Elidrisi, Pallavi Kalapatapu, Ammar Ahmed, Ali Anwar, Muhammad Ali Gulzar,
Abstract summary: Caching is a natural solution to reduce inference costs on repeated queries. This paper introduces MeanCache, a user-centric semantic cache for LLM-based services. MeanCache identifies semantically similar queries to determine cache hit or miss.
Score: 8.350378532274405
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion parameters, where inference demands billions of floating-point operations. Caching is a natural solution to reduce LLM inference costs on repeated queries, which constitute about 31% of the total queries. However, existing caching methods are incapable of finding semantic similarities among LLM queries nor do they operate on contextual queries, leading to unacceptable false hit-and-miss rates. This paper introduces MeanCache, a user-centric semantic cache for LLM-based services that identifies semantically similar queries to determine cache hit or miss. Using MeanCache, the response to a user's semantically similar query can be retrieved from a local cache rather than re-querying the LLM, thus reducing costs, service provider load, and environmental impact. MeanCache leverages Federated Learning (FL) to collaboratively train a query similarity model without violating user privacy. By placing a local cache in each user's device and using FL, MeanCache reduces the latency and costs and enhances model performance, resulting in lower false hit rates. MeanCache also encodes context chains for every cached query, offering a simple yet highly effective mechanism to discern contextual query responses from standalone. Our experiments benchmarked against the state-of-the-art caching method, reveal that MeanCache attains an approximately 17% higher F-score and a 20% increase in precision during semantic cache hit-and-miss decisions while performing even better on contextual queries. It also reduces the storage requirement by 83% and accelerates semantic cache hit-and-miss decisions by 11%.

Related papers

A Generative Caching System for Large Language Models [1.2132389187658934]
Caching has the potential to be of significant benefit for accessing large language models (LLMs) This paper presents a new caching system for improving user experiences with LLMs. A key feature we provide is generative caching, wherein multiple cached responses can be synthesized to provide answers to queries which have never been seen before.
arXiv Detail & Related papers (2025-03-22T01:17:56Z)
Auditing Prompt Caching in Language Model APIs [77.02079451561718]
We investigate the privacy leakage caused by prompt caching in large language models (LLMs) We detect global cache sharing across users in seven API providers, including OpenAI. We find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.
arXiv Detail & Related papers (2025-02-11T18:58:04Z)
Adaptive Semantic Prompt Caching with VectorQ [78.59891542553179]
Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different embeddings. We propose VectorQ, an online framework with a threshold convergence guarantee to learn embedding-specific threshold regions.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
InstCache: A Predictive Cache for LLM Serving [9.878166964839512]
We propose to predict user-instructions by an instruction-aligned LLM and store them in a predictive cache, so-called InstCache. Experimental results show that InstCache can achieve up to 51.34% hit rate on LMSys dataset, which corresponds to a 2x speedup, at a memory cost of only 4.5GB.
arXiv Detail & Related papers (2024-11-21T03:52:41Z)
GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching [0.0]
GPT Semantic Cache is a method that leverages semantic caching of query embeddings in in-memory storage (Redis) Our approach efficiently identifies semantically similar questions, allowing for the retrieval of pre-generated responses without redundant API calls to the Large Language Models. This technique reduces operational costs and improves response times, enhancing the efficiency of LLM-powered applications.
arXiv Detail & Related papers (2024-11-08T02:21:19Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns. Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z)
Optimizing LLM Queries in Relational Workloads [58.254894049950366]
We show how to optimize Large Language Models (LLMs) inference for analytical workloads that invoke LLMs within relational queries. We implement these optimizations in Apache Spark, with vLLM as the model serving backend. We achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets.
arXiv Detail & Related papers (2024-03-09T07:01:44Z)
A Learning-Based Caching Mechanism for Edge Content Delivery [2.412158290827225]
5G networks and the rise of the Internet of Things (IoT) are increasingly extending into the network edge. This shift introduces unique challenges, particularly due to the limited cache storage and the diverse request patterns at the edge. We introduce HR-Cache, a learning-based caching framework grounded in the principles of Hazard Rate (HR) ordering.
arXiv Detail & Related papers (2024-02-05T08:06:03Z)
LLMs for Test Input Generation for Semantic Caches [1.8628177380024746]
Large language models (LLMs) enable state-of-the-art semantic capabilities to be added to software systems. At scale, the cost of serving thousands of users increases massively affecting also user experience. We present VaryGen, an approach for using LLMs for test input generation that produces similar questions from unstructured text documents.
arXiv Detail & Related papers (2024-01-16T06:16:33Z)
DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models. Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.