Related papers: MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction

MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction

URL: http://arxiv.org/abs/2211.02177v1
Date: Thu, 3 Nov 2022 23:10:21 GMT
Title: MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction
Authors: Gabriele Tolomei and Lorenzo Takanen and Fabio Pinelli
Abstract summary: MUSTACHE is a new page cache replacement whose logic is learned from observed memory access requests rather than fixed like existing policies. We formulate the page request prediction problem as a categorical time series forecasting task. Our method queries the learned page request forecaster to obtain the next $k$ predicted page memory references to better approximate the optimal B'el'ady's replacement algorithm.
Score: 0.709016563801433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we propose MUSTACHE, a new page cache replacement algorithm whose logic is learned from observed memory access requests rather than fixed like existing policies. We formulate the page request prediction problem as a categorical time series forecasting task. Then, our method queries the learned page request forecaster to obtain the next $k$ predicted page memory references to better approximate the optimal B\'el\'ady's replacement algorithm. We implement several forecasting techniques using advanced deep learning architectures and integrate the best-performing one into an existing open-source cache simulator. Experiments run on benchmark datasets show that MUSTACHE outperforms the best page replacement heuristic (i.e., exact LRU), improving the cache hit ratio by 1.9% and reducing the number of reads/writes required to handle cache misses by 18.4% and 10.3%.

Related papers

MagCache: Fast Video Generation with Magnitude-Aware Cache [91.51242917160373]
We introduce a novel and robust discovery: a unified magnitude law observed across different models and prompts.<n>We introduce a Magnitude-aware Cache (MagCache) that adaptively skips unimportant timesteps using an error modeling mechanism and adaptive caching strategy.<n> Experimental results show that MagCache achieves 2.1x and 2.68x speedups on Open-Sora and Wan 2.1, respectively.
arXiv Detail & Related papers (2025-06-10T17:59:02Z)
Efficient and Optimal No-Regret Caching under Partial Observation [11.537072761243344]
We study the caching problem in a more restrictive setting where only a fraction of past requests are observed. We propose a randomized caching policy with sublinear regret based on classic online learning algorithm Follow-the-Perturbed-Leader.
arXiv Detail & Related papers (2025-03-04T16:21:33Z)
vCache: Verified Semantic Prompt Caching [75.87215136638828]
This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees.<n>It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training.<n>Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
Efficient Prompt Caching via Embedding Similarity [26.456212783693545]
We focus on the prediction accuracy of prompt caching for single-round question-answering tasks via embedding similarity. We propose a distillation-based method to fine-tune the existing embeddings for better better prediction. We also conduct simulations demonstrating that our trained models achieve better caching efficiency than the previous embedding model.
arXiv Detail & Related papers (2024-02-02T06:34:11Z)
TransforMAP: Transformer for Memory Access Prediction [10.128730975303407]
Data Prefetching is a technique that can hide memory latency by fetching data before it is needed by a program. We develop TransforMAP, based on the powerful Transformer model, that can learn from the whole address space. We show that our approach achieves 35.67% MPKI improvement, higher than state-of-the-art prefetcher and ISB prefetcher.
arXiv Detail & Related papers (2022-05-29T22:14:38Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
ARCH: Efficient Adversarial Regularized Training with Caching [91.74682538906691]
Adversarial regularization can improve model generalization in many natural language processing tasks. We propose a new adversarial regularization method ARCH, where perturbations are generated and cached once every several epochs. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks.
arXiv Detail & Related papers (2021-09-15T02:05:37Z)
IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings. We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data. We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z)
Learning Forward Reuse Distance [1.8777512961936749]
Recent advancement of deep learning techniques enables the design of novel intelligent cache replacement policies. We find that a powerful LSTM-based recurrent neural network model can provide high prediction accuracy based on only a cache trace as input. Results demonstrate that the new cache policy improves state-of-art practical policies by up to 19.2% and incurs only 2.3% higher miss ratio than OPT on average.
arXiv Detail & Related papers (2020-07-31T05:57:50Z)
Change Rate Estimation and Optimal Freshness in Web Page Crawling [2.4923006485141284]
finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. The ideal crawling rates are the ones that maximise the freshness of the local cache. We provide two novel schemes for online estimation of page change rates.
arXiv Detail & Related papers (2020-04-05T11:48:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.