Towards High Performance, Portability, and Productivity: Lightweight
Augmented Neural Networks for Performance Prediction
- URL: http://arxiv.org/abs/2003.07497v2
- Date: Sun, 30 Aug 2020 08:30:24 GMT
- Title: Towards High Performance, Portability, and Productivity: Lightweight
Augmented Neural Networks for Performance Prediction
- Authors: Ajitesh Srivastava (1), Naifeng Zhang (1), Rajgopal Kannan (2), Viktor
K. Prasanna (1) ((1) University of Southern California, (2) US Army Research
Lab-West)
- Abstract summary: We propose lightweight augmented neural networks for arbitrary combinations of kernel-variant- hardware.
We are able to obtain a low MAPE of 3%, significantly outperforming traditional feed-forward neural networks.
Our variant-selection approach can be used in Halide implementations to obtain up to 1.7x speedup over Halide's auto-scheduler.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Writing high-performance code requires significant expertise in the
programming language, compiler optimizations, and hardware knowledge. This
often leads to poor productivity and portability and is inconvenient for a
non-programmer domain-specialist such as a Physicist. More desirable is a
high-level language where the domain-specialist simply specifies the workload
in terms of high-level operations (e.g., matrix-multiply(A, B)), and the
compiler identifies the best implementation fully utilizing the heterogeneous
platform. For creating a compiler that supports productivity, portability, and
performance simultaneously, it is crucial to predict the performance of various
available implementations (variants) of the dominant operations (kernels)
contained in the workload on various hardware to decide (a) which variant
should be chosen for each kernel in the workload, and (b) on which hardware
resource the variant should run. To enable the performance prediction, we
propose lightweight augmented neural networks for arbitrary combinations of
kernel-variant-hardware. A key innovation is utilizing the mathematical
complexity of the kernels as a feature to achieve higher accuracy. These models
are compact to reduce training time and fast inference during compile-time and
run-time. Using models with less than 75 parameters, and only 250 training data
instances, we are able to obtain a low MAPE of 3%, significantly outperforming
traditional feed-forward neural networks on 48 kernel-variant-hardware
combinations. We further demonstrate that our variant-selection approach can be
used in Halide implementations to obtain up to 1.7x speedup over Halide's
auto-scheduler.
Related papers
- Jacobian-Enhanced Neural Networks [0.0]
Jacobian-Enhanced Neural Networks (JENN) are densely connected multi-layer perceptrons.
JENN's main benefit is better accuracy with fewer training points compared to standard neural networks.
arXiv Detail & Related papers (2024-06-13T14:04:34Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep
Learning Compilation [8.64220475114214]
oneDNN Graph Compiler employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation.
Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical computation graphs.
arXiv Detail & Related papers (2023-01-03T19:52:17Z) - Towards making the most of NLP-based device mapping optimization for
OpenCL kernels [5.6596607119831575]
We extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection ( CPU or GPU) for accelerated OpenCL kernels.
We propose four different models that provide enhanced contextual information of source codes.
Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4% improvement in prediction accuracy.
arXiv Detail & Related papers (2022-08-30T10:20:55Z) - Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to
Infer Hardware Performances [58.720142291102135]
'VPUNN' is a neural network-based cost model trained on low-level task profiling.
It consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.
arXiv Detail & Related papers (2022-05-09T22:48:39Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - Efficient Algorithms for Device Placement of DNN Graph Operators [12.871398348743591]
Modern machine learning workloads use large models, with complex structures, that are very expensive to execute.
The devices that execute complex models are becoming increasingly heterogeneous as we see a flourishing of domain-specific accelerators being offered as hardware accelerators in addition to CPUs.
Recent work has shown that significant gains can be obtained with model parallelism, i.e., partitioning a neural network's computational graph onto multiple devices.
In this paper, we identify and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings.
arXiv Detail & Related papers (2020-06-29T22:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.