Related papers: LeadCache: Regret-Optimal Caching in Networks

LeadCache: Regret-Optimal Caching in Networks

URL: http://arxiv.org/abs/2009.08228v4
Date: Tue, 26 Oct 2021 13:59:48 GMT
Title: LeadCache: Regret-Optimal Caching in Networks
Authors: Debjit Paria, Abhishek Sinha
Abstract summary: We propose an efficient online caching policy based on the Follow-the-Perturbed-Leader paradigm. We show that $textttLeadCache$ is regret-optimal up to a factor $tildeO(n3/8), where $n$ is the number of users.
Score: 8.208569626646034
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We consider an online prediction problem in the context of network caching. Assume that multiple users are connected to several caches via a bipartite network. At any time slot, each user may request an arbitrary file chosen from a large catalog. A user's request at a slot is met if the requested file is cached in at least one of the caches connected to the user. Our objective is to predict, prefetch, and optimally distribute the files on the caches at each slot to maximize the total number of cache hits. The problem is non-trivial due to the non-convex and non-smooth nature of the objective function. In this paper, we propose $\texttt{LeadCache}$ - an efficient online caching policy based on the Follow-the-Perturbed-Leader paradigm. We show that $\texttt{LeadCache}$ is regret-optimal up to a factor of $\tilde{O}(n^{3/8}),$ where $n$ is the number of users. We design two efficient implementations of the $\texttt{LeadCache}$ policy, one based on Pipage rounding and the other based on Madow's sampling, each of which makes precisely one call to an LP-solver per iteration. Furthermore, with a Strong-Law-type assumption, we show that the total number of file fetches under $\texttt{LeadCache}$ remains almost surely finite over an infinite horizon. Finally, we derive an approximately tight regret lower bound using results from graph coloring. We conclude that the learning-based $\texttt{LeadCache}$ policy decisively outperforms the state-of-the-art caching policies both theoretically and empirically.

Related papers

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z)
KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider [15.532112534717262]
Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency.<n>We present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers.<n>We propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.
arXiv Detail & Related papers (2025-06-03T08:51:38Z)
QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation. DiTs come with significant drawbacks, including increased computational and memory costs. We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z)
vCache: Verified Semantic Prompt Caching [75.87215136638828]
This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees.<n>It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training.<n>Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
InstCache: A Predictive Cache for LLM Serving [9.878166964839512]
We propose to predict user-instructions by an instruction-aligned LLM and store them in a predictive cache, so-called InstCache. Experimental results show that InstCache can achieve up to 51.34% hit rate on LMSys dataset, which corresponds to a 2x speedup, at a memory cost of only 4.5GB.
arXiv Detail & Related papers (2024-11-21T03:52:41Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services [8.350378532274405]
Caching is a natural solution to reduce inference costs on repeated queries. This paper introduces MeanCache, a user-centric semantic cache for LLM-based services. MeanCache identifies semantically similar queries to determine cache hit or miss.
arXiv Detail & Related papers (2024-03-05T06:23:50Z)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache. Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs. We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)
Optimistic No-regret Algorithms for Discrete Caching [6.182368229968862]
We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning. We provide a universal lower bound for prediction-assisted online caching and design a suite of policies with a range of performance-complexity trade-offs. Our results substantially improve upon all recently-proposed online caching policies, which, being unable to exploit the oracle predictions, offer only $O(sqrtT)$ regret.
arXiv Detail & Related papers (2022-08-15T09:18:41Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
Online Caching with Optimal Switching Regret [10.447270433913134]
A cache of limited storage capacity can hold $C$ files at a time from a large catalog. In the case of a cache-hit, the policy receives a unit reward and zero rewards otherwise. The objective is to design a caching policy that incurs minimal regret while considering both the rewards due to cache-hits and the switching cost due to the file fetches.
arXiv Detail & Related papers (2021-01-18T12:47:22Z)
Quantum Differentially Private Sparse Regression Learning [132.1981461292324]
We devise an efficient quantum differentially private (QDP) Lasso estimator to solve sparse regression tasks. Last, we exhibit that the QDP Lasso attains a near-optimal utility bound $tildeO(N-2/3)$ with privacy guarantees.
arXiv Detail & Related papers (2020-07-23T10:50:42Z)
Reinforcement Learning for Caching with Space-Time Popularity Dynamics [61.55827760294755]
caching is envisioned to play a critical role in next-generation networks. To intelligently prefetch and store contents, a cache node should be able to learn what and when to cache. This chapter presents a versatile reinforcement learning based approach for near-optimal caching policy design.
arXiv Detail & Related papers (2020-05-19T01:23:51Z)
PA-Cache: Evolving Learning-Based Popularity-Aware Content Caching in Edge Networks [14.939950326112045]
We propose an evolving learning-based content caching policy, named PA-Cache in edge networks. It adaptively learns time-varying content popularity and determines which contents should be replaced when the cache is full. We extensively evaluate the performance of our proposed PA-Cache on real-world traces from a large online video-on-demand service provider.
arXiv Detail & Related papers (2020-02-20T15:38:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.