Related papers: LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference

LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference

URL: http://arxiv.org/abs/2509.15515v1
Date: Fri, 19 Sep 2025 01:39:08 GMT
Title: LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference
Authors: Hantao Yang, Hong Xie, Defu Lian, Enhong Chen,
Abstract summary: We treat optimal cache selection as a knapsack problem and employ an accumulation-based strategy to balance computational overhead and cache updates.<n>We prove that the regret of our algorithm achieves an $O(sqrtMNT)$ bound, improving the coefficient of $sqrtMN$ compared to the $O(MNsqrtT)$ in Berkeley.<n>We also provide a problem-dependent bound, which was absent in previous works.
Score: 87.57291812372848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper revisits the LLM cache bandit problem, with a special focus on addressing the query heterogeneity for cost-effective LLM inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for cache selection, making the cache replacement process more computationally and statistically challenging. We treat optimal cache selection as a knapsack problem and employ an accumulation-based strategy to effectively balance computational overhead and cache updates. In theoretical analysis, we prove that the regret of our algorithm achieves an $O(\sqrt{MNT})$ bound, improving the coefficient of $\sqrt{MN}$ compared to the $O(MN\sqrt{T})$ result in Berkeley, where $N$ is the total number of queries and $M$ is the cache size. Additionally, we also provide a problem-dependent bound, which was absent in previous works. The experiment rely on real-world data show that our algorithm reduces the total cost by approximately 12\%.

Related papers

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation [54.61034867177997]
Caching inference responses allows them to be retrieved without another forward pass through the Large Language Models.<n>Traditional exact-match caching overlooks the semantic similarity between queries, leading to unnecessary recomputation.<n>We present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions.
arXiv Detail & Related papers (2025-08-11T06:53:27Z)
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference [2.3587921104010756]
We propose BUZZ, a novel KV caching algorithm to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA.
arXiv Detail & Related papers (2024-10-30T14:53:37Z)
LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts. LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z)
MeanCache: User-Centric Semantic Caching for LLM Web Services [8.350378532274405]
Caching is a natural solution to reduce inference costs on repeated queries.<n>This paper introduces MeanCache, a user-centric semantic cache for LLM-based services.<n>MeanCache identifies semantically similar queries to determine cache hit or miss.
arXiv Detail & Related papers (2024-03-05T06:23:50Z)
Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z)
JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost. We present JoinGym, a query optimization environment for bushy reinforcement learning (RL) Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z)
MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction [0.709016563801433]
MUSTACHE is a new page cache replacement whose logic is learned from observed memory access requests rather than fixed like existing policies. We formulate the page request prediction problem as a categorical time series forecasting task. Our method queries the learned page request forecaster to obtain the next $k$ predicted page memory references to better approximate the optimal B'el'ady's replacement algorithm.
arXiv Detail & Related papers (2022-11-03T23:10:21Z)
How to Query An Oracle? Efficient Strategies to Label Data [59.89900843097016]
We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning. We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of $O(fracNk2)$. In addition, we present an adaptive greedy query scheme, which achieves an average rate of $approx 0.2N$ queries per sample with triplet queries.
arXiv Detail & Related papers (2021-10-05T20:15:35Z)
Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty Oracle [7.449644976563424]
We propose an elegant theoretical model for studying clustering with a faulty oracle. It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters. We provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(log2 n)$) for all constant $k$ and any $delta$ in the regime when information-theoretic recovery is possible.
arXiv Detail & Related papers (2021-06-18T22:20:12Z)
Query-Efficient Correlation Clustering [13.085439249887713]
Correlation clustering is arguably the most natural formulation of clustering. A main drawback of correlation clustering is that it requires as input the $Theta(n2)$ pairwise similarities. We devise a correlation clustering algorithm that attains a solution whose expected number of disagreements is at most $3cdot OPT + O(fracn3Q)$.
arXiv Detail & Related papers (2020-02-26T15:18:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.