Towards strong pruning for lottery tickets with non-zero biases
- URL: http://arxiv.org/abs/2110.11150v1
- Date: Thu, 21 Oct 2021 13:56:04 GMT
- Title: Towards strong pruning for lottery tickets with non-zero biases
- Authors: Jonas Fischer, Rebekka Burkholz
- Abstract summary: Lottery ticket hypothesis holds promise that pruning randomly deep neural networks could offer efficient alternative to deep learning.
Common parameter schemes and existence proofs are focused on networks with zero gradient biases.
We extend these schemes and existence proofs to non-zero biases, including explicit 'looks-linear' approaches for ReLU activation functions.
- Score: 6.85316573653194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The strong lottery ticket hypothesis holds the promise that pruning randomly
initialized deep neural networks could offer a computationally efficient
alternative to deep learning with stochastic gradient descent. Common parameter
initialization schemes and existence proofs, however, are focused on networks
with zero biases, thus foregoing the potential universal approximation property
of pruning. To fill this gap, we extend multiple initialization schemes and
existence proofs to non-zero biases, including explicit 'looks-linear'
approaches for ReLU activation functions. These do not only enable truly
orthogonal parameter initialization but also reduce potential pruning errors.
In experiments on standard benchmark data sets, we further highlight the
practical benefits of non-zero bias initialization schemes, and present
theoretically inspired extensions for state-of-the-art strong lottery ticket
pruning.
Related papers
- Principles for Initialization and Architecture Selection in Graph Neural
Networks with ReLU Activations [17.51364577113718]
We show three principles for architecture selection in finite width graph neural networks (GNNs) with ReLU activations.
First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization.
Second, we prove in finite width vanilla ReLU GNNs that oversmoothing is unavoidable at large depth when using fixed aggregation operator.
arXiv Detail & Related papers (2023-06-20T16:40:41Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Dynamical Isometry for Residual Networks [8.21292084298669]
We show that RISOTTO achieves perfect dynamical isometry for residual networks with ReLU activation functions even for finite depth and width.
In experiments, we demonstrate that our approach outperforms schemes proposed to make Batch Normalization obsolete, including Fixup and SkipInit.
arXiv Detail & Related papers (2022-10-05T17:33:23Z) - Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions.
Subfunctions have their own activation pattern, domain, and empirical error.
Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI)
Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z) - Implicit Regularization in ReLU Networks with the Square Loss [56.70360094597169]
We show that it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters.
Our results suggest that a more general framework may be needed to understand implicit regularization for nonlinear predictors.
arXiv Detail & Related papers (2020-12-09T16:48:03Z) - What needles do sparse neural networks find in nonlinear haystacks [0.0]
A sparsity inducing penalty in artificial neural networks (ANNs) avoids over-fitting, especially in situations where noise is high and the training set is small.
For linear models, such an approach provably also recovers the important features with high probability in regimes for a well-chosen penalty parameter.
We perform a set of comprehensive Monte Carlo simulations on a simple model, and the numerical results show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-06-07T04:46:55Z) - Fractional moment-preserving initialization schemes for training deep
neural networks [1.14219428942199]
A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations.
In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution.
We show through numerical experiments that our schemes can improve the training and test performance.
arXiv Detail & Related papers (2020-05-25T01:10:01Z) - Optimistic Exploration even with a Pessimistic Initialisation [57.41327865257504]
Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL)
In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values.
We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network.
arXiv Detail & Related papers (2020-02-26T17:15:53Z) - Bayesian Deep Learning and a Probabilistic Perspective of Generalization [56.69671152009899]
We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization.
We also propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction.
arXiv Detail & Related papers (2020-02-20T15:13:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.