Accelerating Deep Learning Inference via Learned Caches
- URL: http://arxiv.org/abs/2101.07344v1
- Date: Mon, 18 Jan 2021 22:13:08 GMT
- Title: Accelerating Deep Learning Inference via Learned Caches
- Authors: Arjun Balasubramanian, Adarsh Kumar, Yuhan Liu, Han Cao, Shivaram
Venkataraman, Aditya Akella
- Abstract summary: Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems.
Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads.
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
- Score: 11.617579969991294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Neural Networks (DNNs) are witnessing increased adoption in multiple
domains owing to their high accuracy in solving real-world problems. However,
this high accuracy has been achieved by building deeper networks, posing a
fundamental challenge to the low latency inference desired by user-facing
applications. Current low latency solutions trade-off on accuracy or fail to
exploit the inherent temporal locality in prediction serving workloads.
We observe that caching hidden layer outputs of the DNN can introduce a form
of late-binding where inference requests only consume the amount of computation
needed. This enables a mechanism for achieving low latencies, coupled with an
ability to exploit temporal locality. However, traditional caching approaches
incur high memory overheads and lookup latencies, leading us to design learned
caches - caches that consist of simple ML models that are continuously updated.
We present the design of GATI, an end-to-end prediction serving system that
incorporates learned caches for low-latency DNN inference. Results show that
GATI can reduce inference latency by up to 7.69X on realistic workloads.
Related papers
- QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models [2.6663666678221376]
Convolutional neural networks (CNNs) have made significant advances in computer vision tasks, yet their high inference times and latency limit real-world applicability.
We introduce QIANets: a novel approach of redesigning the traditional GoogLeNet, DenseNet, and ResNet-18 model architectures to process more parameters and computations whilst maintaining low inference times.
Despite experimental limitations, the method was tested and evaluated, demonstrating reductions in inference times, along with effective accuracy preservations.
arXiv Detail & Related papers (2024-10-14T09:24:48Z) - Accelerating Scalable Graph Neural Network Inference with Node-Adaptive
Propagation [80.227864832092]
Graph neural networks (GNNs) have exhibited exceptional efficacy in a diverse array of applications.
The sheer size of large-scale graphs presents a significant challenge to real-time inference with GNNs.
We propose an online propagation framework and two novel node-adaptive propagation methods.
arXiv Detail & Related papers (2023-10-17T05:03:00Z) - Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications.
Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure.
We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - Improving the Performance of DNN-based Software Services using Automated
Layer Caching [3.804240190982695]
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services.
The computational complexity in such large models can still be relatively significant, hindering low inference latency.
In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services.
arXiv Detail & Related papers (2022-09-18T18:21:20Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Learning from Images: Proactive Caching with Parallel Convolutional
Neural Networks [94.85780721466816]
A novel framework for proactive caching is proposed in this paper.
It combines model-based optimization with data-driven techniques by transforming an optimization problem into a grayscale image.
Numerical results show that the proposed scheme can reduce 71.6% computation time with only 0.8% additional performance cost.
arXiv Detail & Related papers (2021-08-15T21:32:47Z) - CacheNet: A Model Caching Framework for Deep Learning Inference on the
Edge [3.398008512297358]
CacheNet is a model caching framework for machine perception applications.
It caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers.
It is 58-217% faster than baseline approaches that run inference tasks on end devices or edge servers alone.
arXiv Detail & Related papers (2020-07-03T16:32:14Z) - Accelerating Deep Learning Inference via Freezing [8.521443408415868]
We present Freeze Inference, a system that introduces approximate caching at each intermediate layer.
We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18.
arXiv Detail & Related papers (2020-02-07T07:03:58Z) - An Image Enhancing Pattern-based Sparsity for Real-time Inference on
Mobile Devices [58.62801151916888]
We introduce a new sparsity dimension, namely pattern-based sparsity that comprises pattern and connectivity sparsity, and becoming both highly accurate and hardware friendly.
Our approach on the new pattern-based sparsity naturally fits into compiler optimization for highly efficient DNN execution on mobile platforms.
arXiv Detail & Related papers (2020-01-20T16:17:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.