Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching
- URL: http://arxiv.org/abs/2112.06671v1
- Date: Mon, 13 Dec 2021 13:49:11 GMT
- Title: Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching
- Authors: Alessandro Finamore, James Roberts, Massimo Gallo, Dario Rossi
- Abstract summary: We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
- Score: 72.50506500576746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Deep Learning (DL) technologies are a promising tool to solve
networking problems that map to classification tasks, their computational
complexity is still too high with respect to real-time traffic measurements
requirements. To reduce the DL inference cost, we propose a novel caching
paradigm, that we named approximate-key caching, which returns approximate
results for lookups of selected input based on cached DL inference results.
While approximate cache hits alleviate DL inference workload and increase the
system throughput, they however introduce an approximation error. As such, we
couple approximate-key caching with an error-correction principled algorithm,
that we named auto-refresh. We analytically model our caching system
performance for classic LRU and ideal caches, we perform a trace-driven
evaluation of the expected performance, and we compare the benefits of our
proposed approach with the state-of-the-art similarity caching -- testifying
the practical interest of our proposal.
Related papers
- HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration [18.170285241800798]
We propose a novel method that Harmonizes training and inference with a novel learning-based Caching framework.
Compared to the traditional training paradigm, the newly proposed SDT maintains the continuity of the denoising process.
IEPO integrates an efficient proxy mechanism to approximate the final image error caused by reusing the cached feature.
arXiv Detail & Related papers (2024-10-02T16:34:29Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Digital Twin-Assisted Data-Driven Optimization for Reliable Edge Caching in Wireless Networks [60.54852710216738]
We introduce a novel digital twin-assisted optimization framework, called D-REC, to ensure reliable caching in nextG wireless networks.
By incorporating reliability modules into a constrained decision process, D-REC can adaptively adjust actions, rewards, and states to comply with advantageous constraints.
arXiv Detail & Related papers (2024-06-29T02:40:28Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models [15.742472622602557]
We propose SCALM, a new cache architecture that emphasizes semantic analysis and identifies significant cache entries and patterns.
Our evaluations show that SCALM increases cache hit ratios and reduces operational costs for LLMChat services.
arXiv Detail & Related papers (2024-05-24T08:16:22Z) - Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance.
Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z) - Cache-Aware Reinforcement Learning in Large-Scale Recommender Systems [10.52021139266752]
cache-aware reinforcement learning (CARL) method to jointly optimize the recommendation by real-time computation and by the cache.
CARL can significantly improve the users' engagement when considering the result cache.
CARL has been fully launched in Kwai app, serving over 100 million users.
arXiv Detail & Related papers (2024-04-23T12:06:40Z) - No-Regret Caching with Noisy Request Estimates [12.603423174002254]
We propose the Noisy-Follow-the-Perturbed-Leader (NFPL) algorithm, a variant of the classic Follow-the-Perturbed-Leader (FPL) when request estimates are noisy.
We show that the proposed solution has sublinear regret under specific conditions on the requests estimator.
arXiv Detail & Related papers (2023-09-05T08:57:35Z) - CATRO: Channel Pruning via Class-Aware Trace Ratio Optimization [61.71504948770445]
We propose a novel channel pruning method via Class-Aware Trace Ratio Optimization (CATRO) to reduce the computational burden and accelerate the model inference.
We show that CATRO achieves higher accuracy with similar cost or lower cost with similar accuracy than other state-of-the-art channel pruning algorithms.
Because of its class-aware property, CATRO is suitable to prune efficient networks adaptively for various classification subtasks, enhancing handy deployment and usage of deep networks in real-world applications.
arXiv Detail & Related papers (2021-10-21T06:26:31Z) - Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems.
Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads.
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.