Related papers: Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

URL: http://arxiv.org/abs/2006.02464v2
Date: Mon, 26 Oct 2020 15:52:20 GMT
Title: Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Authors: Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace
Abstract summary: Machine learning inference is becoming a core building block for interactive web applications. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency. We observe that inference using Deep Neural Network (DNN) models has deterministic performance.
Score: 4.293235171619925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable - on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.

Related papers

Accelerate Intermittent Deep Inference [0.0]
contemporary trends focus on making the Deep Neural Net (DNN) Models runnable on battery-less intermittent devices. We proposed Accelerated Intermittent Deep Inference to harness the power of optimized inferencing models specifically targeting under 256KB and make it schedulable and runnable within intermittent power.
arXiv Detail & Related papers (2024-07-01T20:15:16Z)
Continuous time recurrent neural networks: overview and application to forecasting blood glucose in the intensive care unit [56.801856519460465]
Continuous time autoregressive recurrent neural networks (CTRNNs) are a deep learning model that account for irregular observations. We demonstrate the application of these models to probabilistic forecasting of blood glucose in a critical care setting.
arXiv Detail & Related papers (2023-04-14T09:39:06Z)
Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning [11.007816552466952]
This paper focuses on the problem of scheduling inference queries on Deep Neural Networks in edge networks at short timescales. By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP. We design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions.
arXiv Detail & Related papers (2023-01-31T13:23:34Z)
Gated Recurrent Neural Networks with Weighted Time-Delay Feedback [59.125047512495456]
We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism. We show that $tau$-GRU can converge faster and generalize better than state-of-the-art recurrent units and gated recurrent architectures.
arXiv Detail & Related papers (2022-12-01T02:26:34Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
Improving the Performance of DNN-based Software Services using Automated Layer Caching [3.804240190982695]
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. The computational complexity in such large models can still be relatively significant, hindering low inference latency. In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services.
arXiv Detail & Related papers (2022-09-18T18:21:20Z)
EIGNN: Efficient Infinite-Depth Graph Neural Networks [51.97361378423152]
Graph neural networks (GNNs) are widely used for modelling graph-structured data in numerous applications. Motivated by this limitation, we propose a GNN model with infinite depth, which we call Efficient Infinite-Depth Graph Neural Networks (EIGNN) We show that EIGNN has a better ability to capture long-range dependencies than recent baselines, and consistently achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T08:16:58Z)
ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware. The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation. We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z)
Accelerating Deep Learning Inference via Learned Caches [11.617579969991294]
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency inference.
arXiv Detail & Related papers (2021-01-18T22:13:08Z)
Generalized Latency Performance Estimation for Once-For-All Neural Architecture Search [0.0]
We introduce two generalizability strategies which include fine-tuning using a base model trained on a specific hardware and NAS search space. We provide a family of latency prediction models that achieve over 50% lower RMSE loss as compared to ProxylessNAS.
arXiv Detail & Related papers (2021-01-04T00:48:09Z)
Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead. We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.