GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile
Devices based on Fine-Grained Structured Weight Sparsity
- URL: http://arxiv.org/abs/2108.11033v1
- Date: Wed, 25 Aug 2021 03:50:46 GMT
- Title: GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile
Devices based on Fine-Grained Structured Weight Sparsity
- Authors: Wei Niu, Zhengang Li, Xiaolong Ma, Peiyan Dong, Gang Zhou, Xuehai
Qian, Xue Lin, Yanzhi Wang, Bin Ren
- Abstract summary: This paper designs a novel mobile inference acceleration framework GRIM that is General to both convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We propose a new fine-grained structured sparsity scheme through the Block-based Column-Row (BCR) pruning.
Based on this new fine-grained structured sparsity, our GRIM framework consists of two parts: (a) the compiler optimization and code generation for real-time mobile inference.
- Score: 46.75304109970339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is appealing but challenging to achieve real-time deep neural network
(DNN) inference on mobile devices because even the powerful modern mobile
devices are considered as ``resource-constrained'' when executing large-scale
DNNs. It necessitates the sparse model inference via weight pruning, i.e., DNN
weight sparsity, and it is desirable to design a new DNN weight sparsity scheme
that can facilitate real-time inference on mobile devices while preserving a
high sparse model accuracy. This paper designs a novel mobile inference
acceleration framework GRIM that is General to both convolutional neural
networks (CNNs) and recurrent neural networks (RNNs) and that achieves
Real-time execution and high accuracy, leveraging fine-grained structured
sparse model Inference and compiler optimizations for Mobiles. We start by
proposing a new fine-grained structured sparsity scheme through the Block-based
Column-Row (BCR) pruning. Based on this new fine-grained structured sparsity,
our GRIM framework consists of two parts: (a) the compiler optimization and
code generation for real-time mobile inference; and (b) the BCR pruning
optimizations for determining pruning hyperparameters and performing weight
pruning. We compare GRIM with Alibaba MNN, TVM, TensorFlow-Lite, a sparse
implementation based on CSR, PatDNN, and ESE (a representative FPGA inference
acceleration framework for RNNs), and achieve up to 14.08x speedup.
Related papers
- SparseByteNN: A Novel Mobile Inference Acceleration Framework Based on
Fine-Grained Group Sparsity [10.89385369643021]
We present a novel mobile inference acceleration framework SparseByteNN.
We show that for 30% sparse MobileNet-v1, SparseByteNN achieves 1.27x speedup over the dense version and 1.29x speedup over the state-of-the-art sparse inference engine MNN with a slight accuracy drop of 0.224%.
arXiv Detail & Related papers (2023-10-30T13:08:48Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Architecture Aware Latency Constrained Sparse Neural Networks [35.50683537052815]
In this paper, we design an architecture aware latency constrained sparse framework to prune and accelerate CNN models.
We also propose a novel sparse convolution algorithm for efficient computation.
Our system-algorithm co-design framework can achieve much better frontier among network accuracy and latency on resource-constrained mobile devices.
arXiv Detail & Related papers (2021-09-01T03:41:31Z) - A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration
Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices.
Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data.
We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z) - BLK-REW: A Unified Block-based DNN Pruning Framework using Reweighted
Regularization Method [69.49386965992464]
We propose a new block-based pruning framework that comprises a general and flexible structured pruning dimension as well as a powerful and efficient reweighted regularization method.
Our framework is universal, which can be applied to both CNNs and RNNs, implying complete support for the two major kinds ofintensive computation layers.
It is the first time that the weight pruning framework achieves universal coverage for both CNNs and RNNs with real-time mobile acceleration and no accuracy compromise.
arXiv Detail & Related papers (2020-01-23T03:30:56Z) - An Image Enhancing Pattern-based Sparsity for Real-time Inference on
Mobile Devices [58.62801151916888]
We introduce a new sparsity dimension, namely pattern-based sparsity that comprises pattern and connectivity sparsity, and becoming both highly accurate and hardware friendly.
Our approach on the new pattern-based sparsity naturally fits into compiler optimization for highly efficient DNN execution on mobile platforms.
arXiv Detail & Related papers (2020-01-20T16:17:36Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.