Related papers: Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching

Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching

URL: http://arxiv.org/abs/2112.06671v1
Date: Mon, 13 Dec 2021 13:49:11 GMT
Title: Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching
Authors: Alessandro Finamore, James Roberts, Massimo Gallo, Dario Rossi
Abstract summary: We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
Score: 72.50506500576746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Deep Learning (DL) technologies are a promising tool to solve networking problems that map to classification tasks, their computational complexity is still too high with respect to real-time traffic measurements requirements. To reduce the DL inference cost, we propose a novel caching paradigm, that we named approximate-key caching, which returns approximate results for lookups of selected input based on cached DL inference results. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. As such, we couple approximate-key caching with an error-correction principled algorithm, that we named auto-refresh. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching -- testifying the practical interest of our proposal.

Related papers

PromptTea: Let Prompts Tell TeaCache the Optimal Threshold [1.0665410339553834]
A common acceleration strategy involves reusing model outputs via caching mechanisms at fixed intervals.<n>We propose Prompt-Complexity-Aware (PCA) caching, a method that automatically adjusts reuse thresholds based on scene complexity estimated directly from the input prompt.
arXiv Detail & Related papers (2025-07-09T10:53:05Z)
Efficient and Optimal No-Regret Caching under Partial Observation [11.537072761243344]
We study the caching problem in a more restrictive setting where only a fraction of past requests are observed. We propose a randomized caching policy with sublinear regret based on classic online learning algorithm Follow-the-Perturbed-Leader.
arXiv Detail & Related papers (2025-03-04T16:21:33Z)
Adaptive Semantic Prompt Caching with VectorQ [78.59891542553179]
Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different embeddings. We propose VectorQ, an online framework with a threshold convergence guarantee to learn embedding-specific threshold regions.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks [9.748438507132207]
Large language models (LLMs) possess extensive knowledge and question-answering capabilities. cache-sharing methods are commonly employed to enhance efficiency by reusing cached states or responses for the same or similar inference requests. We propose a novel timing-based side-channel attack to execute input theft in LLMs inference.
arXiv Detail & Related papers (2024-11-27T10:14:38Z)
HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration [18.170285241800798]
We propose a novel method that Harmonizes training and inference with a novel learning-based Caching framework. Compared to the traditional training paradigm, the newly proposed SDT maintains the continuity of the denoising process. IEPO integrates an efficient proxy mechanism to approximate the final image error caused by reusing the cached feature.
arXiv Detail & Related papers (2024-10-02T16:34:29Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks [60.54852710216738]
We introduce a novel digital twin-assisted optimization framework, called D-REC, to ensure reliable caching in nextG wireless networks. By incorporating reliability modules into a constrained decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints.
arXiv Detail & Related papers (2024-06-29T02:40:28Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns. Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z)
Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance. Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z)
Cache-Aware Reinforcement Learning in Large-Scale Recommender Systems [10.52021139266752]
cache-aware reinforcement learning (CARL) method to jointly optimize the recommendation by real-time computation and by the cache. CARL can significantly improve the users' engagement when considering the result cache. CARL has been fully launched in Kwai app, serving over 100 million users.
arXiv Detail & Related papers (2024-04-23T12:06:40Z)
No-Regret Caching with Noisy Request Estimates [12.603423174002254]
We propose the Noisy-Follow-the-Perturbed-Leader (NFPL) algorithm, a variant of the classic Follow-the-Perturbed-Leader (FPL) when request estimates are noisy. We show that the proposed solution has sublinear regret under specific conditions on the requests estimator.
arXiv Detail & Related papers (2023-09-05T08:57:35Z)
CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization [61.71504948770445]
We propose a novel channel pruning method via Class-Aware Trace Ratio Optimization (CATRO) to reduce the computational burden and accelerate the model inference. We show that CATRO achieves higher accuracy with similar cost or lower cost with similar accuracy than other state-of-the-art channel pruning algorithms. Because of its class-aware property, CATRO is suitable to prune efficient networks adaptively for various classification subtasks, enhancing handy deployment and usage of deep networks in real-world applications.
arXiv Detail & Related papers (2021-10-21T06:26:31Z)
Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.