Performance Aware Convolutional Neural Network Channel Pruning for
Embedded GPUs
- URL: http://arxiv.org/abs/2002.08697v1
- Date: Thu, 20 Feb 2020 12:07:44 GMT
- Title: Performance Aware Convolutional Neural Network Channel Pruning for
Embedded GPUs
- Authors: Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, Jose Cano, Elliot
J. Crowley, Bjorn Franke, Amos Storkey, Michael O'Boyle
- Abstract summary: We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance.
We also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM.
- Score: 6.035819238203187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional Neural Networks (CNN) are becoming a common presence in many
applications and services, due to their superior recognition accuracy. They are
increasingly being used on mobile devices, many times just by porting large
models designed for server space, although several model compression techniques
have been considered. One model compression technique intended to reduce
computations is channel pruning. Mobile and embedded systems now have GPUs
which are ideal for the parallel computations of neural networks and for their
lower energy cost per operation. Specialized libraries perform these neural
network computations through highly optimized routines. As we find in our
experiments, these libraries are optimized for the most common network shapes,
making uninstructed channel pruning inefficient. We evaluate higher level
libraries, which analyze the input characteristics of a convolutional layer,
based on which they produce optimized OpenCL (Arm Compute Library and TVM) and
CUDA (cuDNN) code. However, in reality, these characteristics and subsequent
choices intended for optimization can have the opposite effect. We show that a
reduction in the number of convolutional channels, pruning 12% of the initial
size, is in some cases detrimental to performance, leading to 2x slowdown. On
the other hand, we also find examples where performance-aware pruning achieves
the intended results, with performance speedups of 3x with cuDNN and above 10x
with Arm Compute Library and TVM. Our findings expose the need for
hardware-instructed neural network pruning.
Related papers
- Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators [0.0]
Deep Neural Networks (DNNs) are being developed, trained, and utilized, putting a strain on both advanced and limited devices.
Our solution is to implement em weight block sparsity, which is a structured sparsity that is friendly to hardware.
We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16.
arXiv Detail & Related papers (2024-07-12T17:37:49Z) - Resource Constrained Model Compression via Minimax Optimization for
Spiking Neural Networks [11.19282454437627]
Spiking Neural Networks (SNNs) have the characteristics of event-driven and high energy-efficient networks.
It is difficult to deploy these networks on resource-limited edge devices directly.
We propose an improved end-to-end Minimax optimization method for this sparse learning problem.
arXiv Detail & Related papers (2023-08-09T02:50:15Z) - Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x.
We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers.
We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - Toward Accurate Platform-Aware Performance Modeling for Deep Neural
Networks [0.17499351967216337]
We provide a machine learning-based method, PerfNetV2, which improves the accuracy of our previous work for modeling the neural network performance on a variety of GPU accelerators.
Given an application, the proposed method can be used to predict the inference time and training time of the convolutional neural networks used in the application.
Our case studies show that PerfNetV2 yields a mean absolute percentage error within 13.1% on LeNet, AlexNet, and VGG16 on NVIDIA GTX-1080Ti, while the error rate on a previous work published in ICBD 2018 could be as large as 200%.
arXiv Detail & Related papers (2020-12-01T01:42:23Z) - When deep learning models on GPU can be accelerated by taking advantage
of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU)
The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution.
We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z) - Optimization of XNOR Convolution for Binary Convolutional Neural
Networks on GPU [2.578242050187029]
We propose an implementation of binary convolutional network inference on GPU.
Experimental results show that using GPU can provide a speed-up of up to $42.61times$ with a kernel size of $3times3$.
arXiv Detail & Related papers (2020-07-28T13:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.