What to Prune and What Not to Prune at Initialization
- URL: http://arxiv.org/abs/2209.02201v1
- Date: Tue, 6 Sep 2022 03:48:10 GMT
- Title: What to Prune and What Not to Prune at Initialization
- Authors: Maham Haroon
- Abstract summary: Post-training dropout based approaches achieve high sparsity.
Initialization pruning is more efficacious when it comes to scaling computation cost of the network.
The goal is to achieve higher sparsity while preserving performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-training dropout based approaches achieve high sparsity and are well
established means of deciphering problems relating to computational cost and
overfitting in Neural Network architectures. Contrastingly, pruning at
initialization is still far behind. Initialization pruning is more efficacious
when it comes to scaling computation cost of the network. Furthermore, it
handles overfitting just as well as post training dropout.
In approbation of the above reasons, the paper presents two approaches to
prune at initialization. The goal is to achieve higher sparsity while
preserving performance. 1) K-starts, begins with k random p-sparse matrices at
initialization. In the first couple of epochs the network then determines the
"fittest" of these p-sparse matrices in an attempt to find the "lottery ticket"
p-sparse network. The approach is adopted from how evolutionary algorithms find
the best individual. Depending on the Neural Network architecture, fitness
criteria can be based on magnitude of network weights, magnitude of gradient
accumulation over an epoch or a combination of both. 2) Dissipating gradients
approach, aims at eliminating weights that remain within a fraction of their
initial value during the first couple of epochs. Removing weights in this
manner despite their magnitude best preserves performance of the network.
Contrarily, the approach also takes the most epochs to achieve higher sparsity.
3) Combination of dissipating gradients and kstarts outperforms either methods
and random dropout consistently.
The benefits of using the provided pertaining approaches are: 1) They do not
require specific knowledge of the classification task, fixing of dropout
threshold or regularization parameters 2) Retraining of the model is neither
necessary nor affects the performance of the p-sparse network.
Related papers
- Learning effective pruning at initialization from iterative pruning [15.842658282636876]
We present an end-to-end neural network-based PaI method to reduce training costs.
Our approach outperforms existing methods in high-sparsity settings.
As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach.
arXiv Detail & Related papers (2024-08-27T03:17:52Z) - Concurrent Training and Layer Pruning of Deep Neural Networks [0.0]
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training.
We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned.
arXiv Detail & Related papers (2024-06-06T23:19:57Z) - Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization [49.06421851486415]
Static sparse training aims to train sparse models from scratch, achieving remarkable results in recent years.
We propose Exact Orthogonal Initialization (EOI), a novel sparse Orthogonal Initialization scheme based on random Givens rotations.
Our method enables training highly sparse 1000-layer and CNN networks without residual connections or normalization techniques.
arXiv Detail & Related papers (2024-06-03T19:44:47Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Robust Learning of Parsimonious Deep Neural Networks [0.0]
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network.
We derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection.
We evaluate the proposed algorithm on the MNIST data set and commonly used fully connected and convolutional LeNet architectures.
arXiv Detail & Related papers (2022-05-10T03:38:55Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Rapid Structural Pruning of Neural Networks with Set-based Task-Adaptive
Meta-Pruning [83.59005356327103]
A common limitation of most existing pruning techniques is that they require pre-training of the network at least once before pruning.
We propose STAMP, which task-adaptively prunes a network pretrained on a large reference dataset by generating a pruning mask on it as a function of the target dataset.
We validate STAMP against recent advanced pruning methods on benchmark datasets.
arXiv Detail & Related papers (2020-06-22T10:57:43Z) - Pruning neural networks without any data by iteratively conserving
synaptic flow [27.849332212178847]
Pruning the parameters of deep neural networks has generated intense interest due to potential savings in time, memory and energy.
Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainableworks.
We provide an affirmative answer to this question through theory driven algorithm design.
arXiv Detail & Related papers (2020-06-09T19:21:57Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.