Directional Pruning of Deep Neural Networks
- URL: http://arxiv.org/abs/2006.09358v2
- Date: Tue, 13 Oct 2020 19:58:30 GMT
- Title: Directional Pruning of Deep Neural Networks
- Authors: Shih-Kang Chao, Zhanyu Wang, Yue Xing and Guang Cheng
- Abstract summary: gradient descent (SGD) often finds a flat minimum valley in the training loss.
We propose a novel directional pruning method which searches for a sparse minimizer in or close to that flat region.
- Score: 26.41161344079131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the light of the fact that the stochastic gradient descent (SGD) often
finds a flat minimum valley in the training loss, we propose a novel
directional pruning method which searches for a sparse minimizer in or close to
that flat region. The proposed pruning method does not require retraining or
the expert knowledge on the sparsity level. To overcome the computational
formidability of estimating the flat directions, we propose to use a carefully
tuned $\ell_1$ proximal gradient algorithm which can provably achieve the
directional pruning with a small learning rate after sufficient training. The
empirical results demonstrate the promising results of our solution in highly
sparse regime (92% sparsity) among many existing pruning methods on the
ResNet50 with the ImageNet, while using only a slightly higher wall time and
memory footprint than the SGD. Using the VGG16 and the wide ResNet 28x10 on the
CIFAR-10 and CIFAR-100, we demonstrate that our solution reaches the same
minima valley as the SGD, and the minima found by our solution and the SGD do
not deviate in directions that impact the training loss. The code that
reproduces the results of this paper is available at
https://github.com/donlan2710/gRDA-Optimizer/tree/master/directional_pruning.
Related papers
- Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - AUTOSPARSE: Towards Automated Sparse Training of Deep Neural Networks [2.6742343015805083]
We propose Gradient Annealing (GA) to explore the non-uniform distribution of sparsity inherent within neural networks.
GA provides an elegant trade-off between sparsity and accuracy without the need for additional sparsity-inducing regularization.
We integrate GA with the latest learnable pruning methods to create an automated sparse training algorithm called AutoSparse.
arXiv Detail & Related papers (2023-04-14T06:19:07Z) - Dynamic Sparse Training via Balancing the Exploration-Exploitation
Trade-off [19.230329532065635]
Sparse training could significantly mitigate the training costs by reducing the model size.
Existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies.
In this work, we consider the dynamic sparse training as a sparse connectivity search problem.
Experimental results show that sparse models (up to 98% sparsity) obtained by our proposed method outperform the SOTA sparse training methods.
arXiv Detail & Related papers (2022-11-30T01:22:25Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Controlled Sparsity via Constrained Optimization or: How I Learned to
Stop Tuning Penalties and Love Constraints [81.46143788046892]
We focus on the task of controlling the level of sparsity when performing sparse learning.
Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor.
We propose a constrained formulation where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion.
arXiv Detail & Related papers (2022-08-08T21:24:20Z) - GDP: Stabilized Neural Network Pruning via Gates with Differentiable
Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest.
GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel.
Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z) - Structured Directional Pruning via Perturbation Orthogonal Projection [13.704348351073147]
A more reasonable approach is to find a sparse minimizer along the flat minimum valley found byNIST.
We propose the structured directional pruning based on projecting the perturbations onto the flat minimum valley.
Experiments show that our method obtains the state-of-the-art pruned accuracy (i.e. 93.97% on VGG16, CIFAR-10 task) without retraining.
arXiv Detail & Related papers (2021-07-12T11:35:47Z) - Dep-$L_0$: Improving $L_0$-based Network Sparsification via Dependency
Modeling [6.081082481356211]
Training deep neural networks with an $L_0$ regularization is one of the prominent approaches for network pruning or sparsification.
We show that this method performs inconsistently on large-scale learning tasks, such as ResNet50 on ImageNet.
We propose a dependency modeling of binary gates, which can be modeled effectively as a multi-layer perceptron.
arXiv Detail & Related papers (2021-06-30T19:33:35Z) - RNN Training along Locally Optimal Trajectories via Frank-Wolfe
Algorithm [50.76576946099215]
We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region.
We develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation.
arXiv Detail & Related papers (2020-10-12T01:59:18Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.