TASO: Time and Space Optimization for Memory-Constrained DNN Inference
- URL: http://arxiv.org/abs/2005.10709v1
- Date: Thu, 21 May 2020 15:08:06 GMT
- Title: TASO: Time and Space Optimization for Memory-Constrained DNN Inference
- Authors: Yuan Wen, Andrew Anderson, Valentin Radu, Michael F.P. O'Boyle, David
Gregg
- Abstract summary: Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices.
We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers.
- Score: 5.023660118588569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNNs) are used in many embedded applications,
from industrial robotics and automation systems to biometric identification on
mobile devices. State-of-the-art classification is typically achieved by large
networks, which are prohibitively expensive to run on mobile and embedded
devices with tightly constrained memory and energy budgets. We propose an
approach for ahead-of-time domain specific optimization of CNN models, based on
an integer linear programming (ILP) for selecting primitive operations to
implement convolutional layers. We optimize the trade-off between execution
time and memory consumption by: 1) attempting to minimize execution time across
the whole network by selecting data layouts and primitive operations to
implement each layer; and 2) allocating an appropriate workspace that reflects
the upper bound of memory footprint per layer. These two optimization
strategies can be used to run any CNN on any platform with a C compiler. Our
evaluation with a range of popular ImageNet neural architectures (GoogleNet,
AlexNet, VGG, ResNet and SqueezeNet) on the ARM Cortex-A15 yields speedups of
8x compared to a greedy algorithm based primitive selection, reduces memory
requirement by 2.2x while sacrificing only 15% of inference time compared to a
solver that considers inference time only. In addition, our optimization
approach exposes a range of optimal points for different configurations across
the Pareto frontier of memory and latency trade-off, which can be used under
arbitrary system constraints.
Related papers
- Enhancing MOTION2NX for Efficient, Scalable and Secure Image Inference using Convolutional Neural Networks [4.407841002228536]
We use the ABY2.0 SMPC protocol implemented on the C++ based MOTION2NX framework for secure convolutional neural network (CNN) inference application with semi-honest security.
We also present a novel splitting algorithm that divides the computations at each CNN layer into multiple chunks.
arXiv Detail & Related papers (2024-08-29T09:50:21Z) - Memory-aware Scheduling for Complex Wired Networks with Iterative Graph
Optimization [4.614780125575351]
We propose an efficient memory-aware scheduling framework based on iterative graph optimization.
Our framework features an iterative graph fusion algorithm that simplifies the graph while preserving the scheduling optimality.
arXiv Detail & Related papers (2023-08-26T14:52:02Z) - An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks [0.0]
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive.
We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
arXiv Detail & Related papers (2022-08-30T08:05:43Z) - Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to
Infer Hardware Performances [58.720142291102135]
'VPUNN' is a neural network-based cost model trained on low-level task profiling.
It consistently outperforms the state-of-the-art cost modeling in Intel's line of VPU processors.
arXiv Detail & Related papers (2022-05-09T22:48:39Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time
Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations.
We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - Binary Graph Neural Networks [69.51765073772226]
Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data.
In this paper, we present and evaluate different strategies for the binarization of graph neural networks.
We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks.
arXiv Detail & Related papers (2020-12-31T18:48:58Z) - Optimising the Performance of Convolutional Neural Networks across
Computing Systems using Transfer Learning [0.08594140167290096]
We propose to replace a lengthy profiling stage with a machine learning based approach of performance modeling.
After training, our performance model can estimate the performance of convolutional primitives in any layer configuration.
The time to optimise the execution of large neural networks via primitive selection is reduced from hours to just seconds.
arXiv Detail & Related papers (2020-10-20T20:58:27Z) - Pairwise Neural Networks (PairNets) with Low Memory for Fast On-Device
Applications [0.0]
A traditional artificial neural network (ANN) is normally trained slowly by a gradient descent algorithm, such as the backpropagation algorithm.
We created a novel wide and shallow 4-layer ANN called "Pairwise Neural Network" ("PairNet") with high-speed non-gradient-descent hyper parameter optimization.
arXiv Detail & Related papers (2020-02-10T02:12:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.