HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time
and Robustness
- URL: http://arxiv.org/abs/2205.15568v1
- Date: Tue, 31 May 2022 07:16:14 GMT
- Title: HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time
and Robustness
- Authors: Dennis Rieber and Moritz Reiber and Oliver Bringmann and Holger
Fr\"oning
- Abstract summary: This work evaluates how invalid configurations affect the auto-tuning process and its underlying performance prediction model for the VTA hardware.
A validity-driven method for AutoTVM is developed, only requiring 41.6% of the necessary hardware measurements to find the best solution.
- Score: 1.165213554548421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The process of optimizing the latency of DNN operators with ML models and
hardware-in-the-loop, called auto-tuning, has established itself as a pervasive
method for the deployment of neural networks. From a search space of
loop-optimizations, the candidate providing the best performance has to be
selected. Performance of individual configurations is evaluated through
hardware measurements. The combinatorial explosion of possible configurations,
together with the cost of hardware evaluation makes exhaustive explorations of
the search space infeasible in practice. Machine Learning methods, like random
forests or reinforcement learning are used to aid in the selection of
candidates for hardware evaluation. For general purpose hardware like x86 and
GPGPU architectures impressive performance gains can be achieved, compared to
hand-optimized libraries like cuDNN. The method is also useful in the space of
hardware accelerators with less wide-spread adoption, where a high-performance
library is not always available. However, hardware accelerators are often less
flexible with respect to their programming which leads to operator
configurations not executable on the hardware target. This work evaluates how
these invalid configurations affect the auto-tuning process and its underlying
performance prediction model for the VTA hardware. From these results, a
validity-driven initialization method for AutoTVM is developed, only requiring
41.6% of the necessary hardware measurements to find the best solution, while
improving search robustness.
Related papers
- Towards making the most of NLP-based device mapping optimization for
OpenCL kernels [5.6596607119831575]
We extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection ( CPU or GPU) for accelerated OpenCL kernels.
We propose four different models that provide enhanced contextual information of source codes.
Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4% improvement in prediction accuracy.
arXiv Detail & Related papers (2022-08-30T10:20:55Z) - MAPLE-X: Latency Prediction with Explicit Microprocessor Prior Knowledge [87.41163540910854]
Deep neural network (DNN) latency characterization is a time-consuming process.
We propose MAPLE-X which extends MAPLE by incorporating explicit prior knowledge of hardware devices and DNN architecture latency.
arXiv Detail & Related papers (2022-05-25T11:08:20Z) - Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to
Infer Hardware Performances [58.720142291102135]
'VPUNN' is a neural network-based cost model trained on low-level task profiling.
It consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.
arXiv Detail & Related papers (2022-05-09T22:48:39Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - MAPLE: Microprocessor A Priori for Latency Estimation [81.91509153539566]
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption.
Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process.
We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
arXiv Detail & Related papers (2021-11-30T03:52:15Z) - DANCE: Differentiable Accelerator/Network Co-Exploration [8.540518473228078]
This work presents a differentiable approach towards the co-exploration of the hardware accelerator and network architecture design.
By modeling the hardware evaluation software with a neural network, the relation between the accelerator architecture and the hardware metrics becomes differentiable.
Compared to the naive existing approaches, our method performs co-exploration in a significantly shorter time, while achieving superior accuracy and hardware cost metrics.
arXiv Detail & Related papers (2020-09-14T07:43:27Z) - A Learned Performance Model for Tensor Processing Units [5.733911161090224]
We demonstrate a method of learning performance models from a corpus of graph programs for Processing Unit (TPU) instances.
We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks.
It helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
arXiv Detail & Related papers (2020-08-03T17:24:52Z) - Automated Design Space Exploration for optimised Deployment of DNN on
Arm Cortex-A CPUs [13.628734116014819]
Deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN)
There is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution.
We present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory.
arXiv Detail & Related papers (2020-06-09T11:00:06Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.