Related papers: No-Regret Caching via Online Mirror Descent

No-Regret Caching via Online Mirror Descent

URL: http://arxiv.org/abs/2101.12588v5
Date: Tue, 6 Jun 2023 14:06:09 GMT
Title: No-Regret Caching via Online Mirror Descent
Authors: T. Si Salem, G. Neglia and S. Ioannidis
Abstract summary: We study an online caching problem in which requests can be served by a local cache to avoid retrieval costs from a remote server. We show that bounds for the regret crucially depend on the diversity of the request process, provided by the diversity ratio R/h. We also prove that, when the cache must store the entire file, rather than a fraction, OMD strategies can be coupled with a randomized rounding scheme that preserves regret guarantees.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study an online caching problem in which requests can be served by a local cache to avoid retrieval costs from a remote server. The cache can update its state after a batch of requests and store an arbitrarily small fraction of each file. We study no-regret algorithms based on Online Mirror Descent (OMD) strategies. We show that bounds for the regret crucially depend on the diversity of the request process, provided by the diversity ratio R/h, where R is the size of the batch, and h is the maximum multiplicity of a request in a given batch. We characterize the optimality of OMD caching policies w.r.t. regret under different diversity regimes. We also prove that, when the cache must store the entire file, rather than a fraction, OMD strategies can be coupled with a randomized rounding scheme that preserves regret guarantees, even when update costs cannot be neglected. We provide a formal characterization of the rounding problem through optimal transport theory, and moreover we propose a computationally efficient randomized rounding scheme.

Related papers

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider [15.532112534717262]
Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency.<n>We present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers.<n>We propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.
arXiv Detail & Related papers (2025-06-03T08:51:38Z)
Efficient and Optimal No-Regret Caching under Partial Observation [11.537072761243344]
We study the caching problem in a more restrictive setting where only a fraction of past requests are observed. We propose a randomized caching policy with sublinear regret based on classic online learning algorithm Follow-the-Perturbed-Leader.
arXiv Detail & Related papers (2025-03-04T16:21:33Z)
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV. It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
Adaptive Semantic Prompt Caching with VectorQ [78.59891542553179]
Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different embeddings. We propose VectorQ, an online framework with a threshold convergence guarantee to learn embedding-specific threshold regions.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
On the Regret of Coded Caching with Adversarial Requests [7.171698704686835]
We study the well-known coded caching problem in an online learning framework, wherein requests arrive sequentially. We introduce a caching policy based on the Follow-The-Perturbed-Leader principle and show that for any time horizon T, it achieves a sub-linear regret of mathcalO(sqrt(T).
arXiv Detail & Related papers (2024-09-19T01:13:03Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache. Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs. We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
No-Regret Caching with Noisy Request Estimates [12.603423174002254]
We propose the Noisy-Follow-the-Perturbed-Leader (NFPL) algorithm, a variant of the classic Follow-the-Perturbed-Leader (FPL) when request estimates are noisy. We show that the proposed solution has sublinear regret under specific conditions on the requests estimator.
arXiv Detail & Related papers (2023-09-05T08:57:35Z)
Unrolled Compressed Blind-Deconvolution [77.88847247301682]
sparse multichannel blind deconvolution (S-MBD) arises frequently in many engineering applications such as radar/sonar/ultrasound imaging. We propose a compression method that enables blind recovery from much fewer measurements with respect to the full received signal in time.
arXiv Detail & Related papers (2022-09-28T15:16:58Z)
Optimistic No-regret Algorithms for Discrete Caching [6.182368229968862]
We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning. We provide a universal lower bound for prediction-assisted online caching and design a suite of policies with a range of performance-complexity trade-offs. Our results substantially improve upon all recently-proposed online caching policies, which, being unable to exploit the oracle predictions, offer only $O(sqrtT)$ regret.
arXiv Detail & Related papers (2022-08-15T09:18:41Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
ARCH: Efficient Adversarial Regularized Training with Caching [91.74682538906691]
Adversarial regularization can improve model generalization in many natural language processing tasks. We propose a new adversarial regularization method ARCH, where perturbations are generated and cached once every several epochs. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks.
arXiv Detail & Related papers (2021-09-15T02:05:37Z)
Cache Replacement as a MAB with Delayed Feedback and Decaying Costs [4.358626952482686]
We propose and solve a new variant of the well-known multi-armed bandit (MAB) Each arm represents a distinct cache replacement policy, which advises on the page to evict from the cache when needed. We introduce an adaptive reinforcement learning algorithm EXP4-DFDC that provides a solution to the problem.
arXiv Detail & Related papers (2020-09-23T18:26:48Z)
A Non-Stationary Bandit-Learning Approach to Energy-Efficient Femto-Caching with Rateless-Coded Transmission [98.47527781626161]
We study a resource allocation problem for joint caching and transmission in small cell networks. We then formulate the problem as to select a file from the cache together with a transmission power level for every broadcast round. In contrast to the state-of-the-art research, the proposed approach is especially suitable for networks with time-variant statistical properties.
arXiv Detail & Related papers (2020-04-13T09:07:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.