Heavy Tails in SGD and Compressibility of Overparametrized Neural
Networks
- URL: http://arxiv.org/abs/2106.03795v1
- Date: Mon, 7 Jun 2021 17:02:59 GMT
- Title: Heavy Tails in SGD and Compressibility of Overparametrized Neural
Networks
- Authors: Melih Barsbey, Milad Sefidgaran, Murat A. Erdogdu, Ga\"el Richard,
Umut \c{S}im\c{s}ekli
- Abstract summary: We show that the dynamics of the gradient descent training algorithm has a key role in obtaining compressible networks.
We prove that the networks are guaranteed to be '$ell_p$-compressible', and the compression errors of different pruning techniques become arbitrarily small as the network size increases.
- Score: 9.554646174100123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network compression techniques have become increasingly popular as
they can drastically reduce the storage and computation requirements for very
large networks. Recent empirical studies have illustrated that even simple
pruning strategies can be surprisingly effective, and several theoretical
studies have shown that compressible networks (in specific senses) should
achieve a low generalization error. Yet, a theoretical characterization of the
underlying cause that makes the networks amenable to such simple compression
schemes is still missing. In this study, we address this fundamental question
and reveal that the dynamics of the training algorithm has a key role in
obtaining such compressible networks. Focusing our attention on stochastic
gradient descent (SGD), our main contribution is to link compressibility to two
recently established properties of SGD: (i) as the network size goes to
infinity, the system can converge to a mean-field limit, where the network
weights behave independently, (ii) for a large step-size/batch-size ratio, the
SGD iterates can converge to a heavy-tailed stationary distribution. In the
case where these two phenomena occur simultaneously, we prove that the networks
are guaranteed to be '$\ell_p$-compressible', and the compression errors of
different pruning techniques (magnitude, singular value, or node pruning)
become arbitrarily small as the network size increases. We further prove
generalization bounds adapted to our theoretical framework, which indeed
confirm that the generalization error will be lower for more compressible
networks. Our theory and numerical study on various neural networks show that
large step-size/batch-size ratios introduce heavy-tails, which, in combination
with overparametrization, result in compressibility.
Related papers
- Spike-and-slab shrinkage priors for structurally sparse Bayesian neural networks [0.16385815610837165]
Sparse deep learning addresses challenges by recovering a sparse representation of the underlying target function.
Deep neural architectures compressed via structured sparsity provide low latency inference, higher data throughput, and reduced energy consumption.
We propose structurally sparse Bayesian neural networks which prune excessive nodes with (i) Spike-and-Slab Group Lasso (SS-GL), and (ii) Spike-and-Slab Group Horseshoe (SS-GHS) priors.
arXiv Detail & Related papers (2023-08-17T17:14:18Z) - A Theoretical Understanding of Neural Network Compression from Sparse
Linear Approximation [37.525277809849776]
The goal of model compression is to reduce the size of a large neural network while retaining a comparable performance.
We use sparsity-sensitive $ell_q$-norm to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression.
We also develop adaptive algorithms for pruning each neuron in the network informed by our theory.
arXiv Detail & Related papers (2022-06-11T20:10:35Z) - Fast Conditional Network Compression Using Bayesian HyperNetworks [54.06346724244786]
We introduce a conditional compression problem and propose a fast framework for tackling it.
The problem is how to quickly compress a pretrained large neural network into optimal smaller networks given target contexts.
Our methods can quickly generate compressed networks with significantly smaller sizes than baseline methods.
arXiv Detail & Related papers (2022-05-13T00:28:35Z) - On the Compression of Neural Networks Using $\ell_0$-Norm Regularization
and Weight Pruning [0.9821874476902968]
The present paper is dedicated to the development of a novel compression scheme for neural networks.
A new form of regularization is firstly developed, which is capable of inducing strong sparseness in the network during training.
The proposed compression scheme also involves the use of $ell$-norm regularization to avoid overfitting as well as fine tuning to improve the performance of the pruned network.
arXiv Detail & Related papers (2021-09-10T19:19:42Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Attribution Preservation in Network Compression for Reliable Network
Interpretation [81.84564694303397]
Neural networks embedded in safety-sensitive applications rely on input attribution for hindsight analysis and network compression to reduce its size for edge-computing.
We show that these seemingly unrelated techniques conflict with each other as network compression deforms the produced attributions.
This phenomenon arises due to the fact that conventional network compression methods only preserve the predictions of the network while ignoring the quality of the attributions.
arXiv Detail & Related papers (2020-10-28T16:02:31Z) - ESPN: Extremely Sparse Pruned Networks [50.436905934791035]
We show that a simple iterative mask discovery method can achieve state-of-the-art compression of very deep networks.
Our algorithm represents a hybrid approach between single shot network pruning methods and Lottery-Ticket type approaches.
arXiv Detail & Related papers (2020-06-28T23:09:27Z) - Compact Neural Representation Using Attentive Network Pruning [1.0152838128195465]
We describe a Top-Down attention mechanism that is added to a Bottom-Up feedforward network to select important connections and subsequently prune redundant ones at all parametric layers.
Our method not only introduces a novel hierarchical selection mechanism as the basis of pruning but also remains competitive with previous baseline methods in the experimental evaluation.
arXiv Detail & Related papers (2020-05-10T03:20:01Z) - Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective.
We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.