Related papers: Accelerating Deep Learning Inference via Learned Caches

Accelerating Deep Learning Inference via Learned Caches

URL: http://arxiv.org/abs/2101.07344v1
Date: Mon, 18 Jan 2021 22:13:08 GMT
Title: Accelerating Deep Learning Inference via Learned Caches
Authors: Arjun Balasubramanian, Adarsh Kumar, Yuhan Liu, Han Cao, Shivaram Venkataraman, Aditya Akella
Abstract summary: Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
Score: 11.617579969991294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. However, this high accuracy has been achieved by building deeper networks, posing a fundamental challenge to the low latency inference desired by user-facing applications. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We observe that caching hidden layer outputs of the DNN can introduce a form of late-binding where inference requests only consume the amount of computation needed. This enables a mechanism for achieving low latencies, coupled with an ability to exploit temporal locality. However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference. Results show that GATI can reduce inference latency by up to 7.69X on realistic workloads.

Related papers

Accelerating Linear Recurrent Neural Networks for the Edge with Unstructured Sparsity [39.483346492111515]
Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements when accelerated by compatible hardware platforms. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines.
arXiv Detail & Related papers (2025-02-03T13:09:21Z)
Towards Scalable and Deep Graph Neural Networks via Noise Masking [59.058558158296265]
Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks. scaling them to large graphs is challenging due to the high computational and storage costs. We present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works.
arXiv Detail & Related papers (2024-12-19T07:48:14Z)
QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models [2.6663666678221376]
Convolutional neural networks (CNNs) have made significant advances in computer vision tasks, yet their high inference times and latency limit real-world applicability. We introduce QIANets: a novel approach of redesigning the traditional GoogLeNet, DenseNet, and ResNet-18 model architectures to process more parameters and computations whilst maintaining low inference times. Despite experimental limitations, the method was tested and evaluated, demonstrating reductions in inference times, along with effective accuracy preservations.
arXiv Detail & Related papers (2024-10-14T09:24:48Z)
Accelerating Scalable Graph Neural Network Inference with Node-Adaptive Propagation [80.227864832092]
Graph neural networks (GNNs) have exhibited exceptional efficacy in a diverse array of applications. The sheer size of large-scale graphs presents a significant challenge to real-time inference with GNNs. We propose an online propagation framework and two novel node-adaptive propagation methods.
arXiv Detail & Related papers (2023-10-17T05:03:00Z)
Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications. Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure. We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z)
Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone. This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge. We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z)
Improving the Performance of DNN-based Software Services using Automated Layer Caching [3.804240190982695]
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. The computational complexity in such large models can still be relatively significant, hindering low inference latency. In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services.
arXiv Detail & Related papers (2022-09-18T18:21:20Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
Learning from Images: Proactive Caching with Parallel Convolutional Neural Networks [94.85780721466816]
A novel framework for proactive caching is proposed in this paper. It combines model-based optimization with data-driven techniques by transforming an optimization problem into a grayscale image. Numerical results show that the proposed scheme can reduce 71.6% computation time with only 0.8% additional performance cost.
arXiv Detail & Related papers (2021-08-15T21:32:47Z)
CacheNet: A Model Caching Framework for Deep Learning Inference on the Edge [3.398008512297358]
CacheNet is a model caching framework for machine perception applications. It caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers. It is 58-217% faster than baseline approaches that run inference tasks on end devices or edge servers alone.
arXiv Detail & Related papers (2020-07-03T16:32:14Z)
Accelerating Deep Learning Inference via Freezing [8.521443408415868]
We present Freeze Inference, a system that introduces approximate caching at each intermediate layer. We find that this can potentially reduce the number of effective layers by half for 91.58% of CIFAR-10 requests run on ResNet-18.
arXiv Detail & Related papers (2020-02-07T07:03:58Z)
An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices [58.62801151916888]
We introduce a new sparsity dimension, namely pattern-based sparsity that comprises pattern and connectivity sparsity, and becoming both highly accurate and hardware friendly. Our approach on the new pattern-based sparsity naturally fits into compiler optimization for highly efficient DNN execution on mobile platforms.
arXiv Detail & Related papers (2020-01-20T16:17:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.