Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
- URL: http://arxiv.org/abs/2010.03533v2
- Date: Wed, 16 Mar 2022 00:14:20 GMT
- Title: Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
- Authors: Utku Evci, Yani A. Ioannou, Cem Keskin, Yann Dauphin
- Abstract summary: NNs can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training.
In this paper we show that naively training unstructured sparse NNs from random initialization results in significantly worse generalization.
We also show that Lottery Tickets (LTs) do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from.
- Score: 8.700592446069395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse Neural Networks (NNs) can match the generalization of dense NNs using
a fraction of the compute/storage for inference, and also have the potential to
enable efficient training. However, naively training unstructured sparse NNs
from random initialization results in significantly worse generalization, with
the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training
(DST). Through our analysis of gradient flow during training we attempt to
answer: (1) why training unstructured sparse networks from random
initialization performs poorly and; (2) what makes LTs and DST the exceptions?
We show that sparse NNs have poor gradient flow at initialization and
demonstrate the importance of using sparsity-aware initialization. Furthermore,
we find that DST methods significantly improve gradient flow during training
over traditional sparse training methods. Finally, we show that LTs do not
improve gradient flow, rather their success lies in re-learning the pruning
solution they are derived from - however, this comes at the cost of learning
novel solutions.
Related papers
- Approximation and Gradient Descent Training with Neural Networks [0.0]
Recent work extends a neural tangent kernel (NTK) optimization argument to an under-parametrized regime.
This paper establishes analogous results for networks trained by gradient descent.
arXiv Detail & Related papers (2024-05-19T23:04:09Z) - Dynamic Sparsity Is Channel-Level Sparsity Learner [91.31071026340746]
Dynamic sparse training (DST) is a leading sparse training approach.
Channel-aware dynamic sparse (Chase) seamlessly translates the promise of unstructured dynamic sparsity to channel-level sparsity.
Our approach translates unstructured sparsity to channel-wise sparsity.
arXiv Detail & Related papers (2023-05-30T23:33:45Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Exact Gradient Computation for Spiking Neural Networks Through Forward
Propagation [39.33537954568678]
Spiking neural networks (SNN) have emerged as alternatives to traditional neural networks.
We propose a novel training algorithm, called emphforward propagation (FP), that computes exact gradients for SNN.
arXiv Detail & Related papers (2022-10-18T20:28:21Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Online Training Through Time for Spiking Neural Networks [66.7744060103562]
Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models.
Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency.
We propose online training through time (OTTT) for SNNs, which is derived from BPTT to enable forward-in-time learning.
arXiv Detail & Related papers (2022-10-09T07:47:56Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Selfish Sparse RNN Training [13.165729746380816]
We propose an approach to train sparse RNNs with a fixed parameter count in one single run, without compromising performance.
We achieve state-of-the-art sparse training results with various datasets on Penn TreeBank and Wikitext-2.
arXiv Detail & Related papers (2021-01-22T10:45:40Z) - Fractional moment-preserving initialization schemes for training deep
neural networks [1.14219428942199]
A traditional approach to deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations.
In this paper, we show that weights and therefore pre-activations can be modeled with a heavy-tailed distribution.
We show through numerical experiments that our schemes can improve the training and test performance.
arXiv Detail & Related papers (2020-05-25T01:10:01Z) - Robust Pruning at Initialization [61.30574156442608]
A growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources.
For Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, they do not prevent one layer from being fully pruned.
arXiv Detail & Related papers (2020-02-19T17:09:50Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.