Inference Latency Prediction at the Edge
- URL: http://arxiv.org/abs/2210.02620v1
- Date: Thu, 6 Oct 2022 00:46:06 GMT
- Title: Inference Latency Prediction at the Edge
- Authors: Zhuojin Li, Marco Paolieri and Leana Golubchik
- Abstract summary: State-of-the-art neural architectures (NAs) are typically designed through Neural Architecture Search (NAS) to identify NAs with good tradeoffs between accuracy and efficiency.
Since measuring the latency of a huge set of candidate architectures during NAS is not scalable, approaches are needed for predicting end-to-end inference latency on mobile devices.
We propose a latency prediction framework which addresses these challenges by developing operation-wise latency predictors.
- Score: 0.3974789827371669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing workload of inference tasks on mobile devices,
state-of-the-art neural architectures (NAs) are typically designed through
Neural Architecture Search (NAS) to identify NAs with good tradeoffs between
accuracy and efficiency (e.g., latency). Since measuring the latency of a huge
set of candidate architectures during NAS is not scalable, approaches are
needed for predicting end-to-end inference latency on mobile devices. Such
predictions are challenging due to hardware heterogeneity, optimizations
applied by ML frameworks, and the diversity of neural architectures. Motivated
by these challenges, in this paper, we first quantitatively assess
characteristics of neural architectures and mobile devices that have
significant effects on inference latency. Based on this assessment, we propose
a latency prediction framework which addresses these challenges by developing
operation-wise latency predictors, under a variety of settings and a number of
hardware devices, with multi-core CPUs and GPUs, achieving high accuracy in
end-to-end latency prediction, as shown by our comprehensive evaluations. To
illustrate that our approach does not require expensive data collection, we
also show that accurate predictions can be achieved on real-world NAs using
only small amounts of profiling data.
Related papers
- On Latency Predictors for Neural Architecture Search [8.564763702766776]
We introduce a comprehensive suite of latency prediction tasks obtained in a principled way through automated partitioning of hardware device sets.
We then design a general latency predictor to comprehensively study (1) the predictor architecture, (2) NN sample selection methods, (3) hardware device representations, and (4) NN operation encoding schemes.
Building on conclusions from our study, we present an end-to-end latency predictor training strategy.
arXiv Detail & Related papers (2024-03-04T19:59:32Z) - Efficient Graph Neural Network Inference at Large Scale [54.89457550773165]
Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications.
Existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure.
We propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information.
arXiv Detail & Related papers (2022-11-01T14:38:18Z) - Evaluating Short-Term Forecasting of Multiple Time Series in IoT
Environments [67.24598072875744]
Internet of Things (IoT) environments are monitored via a large number of IoT enabled sensing devices.
To alleviate this issue, sensors are often configured to operate at relatively low sampling frequencies.
This can hamper dramatically subsequent decision-making, such as forecasting.
arXiv Detail & Related papers (2022-06-15T19:46:59Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - MAPLE: Microprocessor A Priori for Latency Estimation [81.91509153539566]
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption.
Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process.
We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
arXiv Detail & Related papers (2021-11-30T03:52:15Z) - HELP: Hardware-Adaptive Efficient Latency Predictor for NAS via
Meta-Learning [43.751220068642624]
Hardware-adaptive Predictor (HELP) is a device-specific latency estimation problem as a meta-learning problem.
We introduce novel hardware embeddings to embed any devices considering them as black-box functions that output latencies, and meta-learn the hardware-adaptive latency predictor in a device-dependent manner.
We validate the proposed HELP for its latency estimation performance on unseen platforms, on which it achieves high estimation performance with as few as 10 measurement samples, outperforming all relevant baselines.
arXiv Detail & Related papers (2021-06-16T08:36:21Z) - Generalized Latency Performance Estimation for Once-For-All Neural
Architecture Search [0.0]
We introduce two generalizability strategies which include fine-tuning using a base model trained on a specific hardware and NAS search space.
We provide a family of latency prediction models that achieve over 50% lower RMSE loss as compared to ProxylessNAS.
arXiv Detail & Related papers (2021-01-04T00:48:09Z) - LETI: Latency Estimation Tool and Investigation of Neural Networks
inference on Mobile GPU [0.0]
In this work, we consider latency approximation on mobile GPU as a data and hardware-specific problem.
We build open-source tools which provide a convenient way to conduct massive experiments on different target devices.
We experimentally demonstrate the applicability of such an approach on a subset of popular NAS-Benchmark 101 dataset.
arXiv Detail & Related papers (2020-10-06T16:51:35Z) - MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [94.80212602202518]
We propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS)
We employ a one-shot architecture search approach in order to obtain a reduced search cost.
We achieve state-of-the-art results in terms of accuracy-speed trade-off.
arXiv Detail & Related papers (2020-09-29T11:56:01Z) - LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud
Networks [73.78551758828294]
LC-NAS is able to find state-of-the-art architectures for point cloud classification with minimal computational cost.
We show how our searched architectures achieve any desired latency with a reasonably low drop in accuracy.
arXiv Detail & Related papers (2020-08-24T10:30:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.