MAPLE: Microprocessor A Priori for Latency Estimation
- URL: http://arxiv.org/abs/2111.15106v1
- Date: Tue, 30 Nov 2021 03:52:15 GMT
- Title: MAPLE: Microprocessor A Priori for Latency Estimation
- Authors: Saad Abbasi, Alexander Wong, and Mohammad Javad Shafiee
- Abstract summary: Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption.
Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process.
We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
- Score: 81.91509153539566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern deep neural networks must demonstrate state-of-the-art accuracy while
exhibiting low latency and energy consumption. As such, neural architecture
search (NAS) algorithms take these two constraints into account when generating
a new architecture. However, efficiency metrics such as latency are typically
hardware dependent requiring the NAS algorithm to either measure or predict the
architecture latency. Measuring the latency of every evaluated architecture
adds a significant amount of time to the NAS process. Here we propose
Microprocessor A Priori for Latency Estimation MAPLE that does not rely on
transfer learning or domain adaptation but instead generalizes to new hardware
by incorporating a prior hardware characteristics during training. MAPLE takes
advantage of a novel quantitative strategy to characterize the underlying
microprocessor by measuring relevant hardware performance metrics, yielding a
fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE
benefits from the tightly coupled I/O between the CPU and GPU and their
dependency to predict DNN latency on GPUs while measuring microprocessor
performance hardware counters from the CPU feeding the GPU hardware. Through
this quantitative strategy as the hardware descriptor, MAPLE can generalize to
new hardware via a few shot adaptation strategy where with as few as 3 samples
it exhibits a 3% improvement over state-of-the-art methods requiring as much as
10 samples. Experimental results showed that, increasing the few shot
adaptation samples to 10 improves the accuracy significantly over the
state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE
exhibiting 8-10% better accuracy, on average, compared to relevant baselines at
any number of adaptation samples.
Related papers
- MONAS: Efficient Zero-Shot Neural Architecture Search for MCUs [5.321424657585365]
MONAS is a novel zero-shot NAS framework specifically designed for microcontrollers (MCUs) in edge computing.
MONAS achieves up to a 1104x improvement in search efficiency over previous work targeting MCUs.
MONAS can discover CNN models with over 3.23x faster inference on MCUs while maintaining similar accuracy compared to more general NAS approaches.
arXiv Detail & Related papers (2024-08-26T10:24:45Z) - On Latency Predictors for Neural Architecture Search [8.564763702766776]
We introduce a comprehensive suite of latency prediction tasks obtained in a principled way through automated partitioning of hardware device sets.
We then design a general latency predictor to comprehensively study (1) the predictor architecture, (2) NN sample selection methods, (3) hardware device representations, and (4) NN operation encoding schemes.
Building on conclusions from our study, we present an end-to-end latency predictor training strategy.
arXiv Detail & Related papers (2024-03-04T19:59:32Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Inference Latency Prediction at the Edge [0.3974789827371669]
State-of-the-art neural architectures (NAs) are typically designed through Neural Architecture Search (NAS) to identify NAs with good tradeoffs between accuracy and efficiency.
Since measuring the latency of a huge set of candidate architectures during NAS is not scalable, approaches are needed for predicting end-to-end inference latency on mobile devices.
We propose a latency prediction framework which addresses these challenges by developing operation-wise latency predictors.
arXiv Detail & Related papers (2022-10-06T00:46:06Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - MAPLE-Edge: A Runtime Latency Predictor for Edge Devices [80.01591186546793]
We propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware.
Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters.
We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes.
arXiv Detail & Related papers (2022-04-27T14:00:48Z) - Latency-Aware Differentiable Neural Architecture Search [113.35689580508343]
Differentiable neural architecture search methods became popular in recent years, mainly due to their low search costs and flexibility in designing the search space.
However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware.
This paper deals with this problem by adding a differentiable latency loss term into optimization, so that the search process can tradeoff between accuracy and latency with a balancing coefficient.
arXiv Detail & Related papers (2020-01-17T15:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.