Implicit Compressibility of Overparametrized Neural Networks Trained
with Heavy-Tailed SGD
- URL: http://arxiv.org/abs/2306.08125v2
- Date: Mon, 12 Feb 2024 10:46:46 GMT
- Title: Implicit Compressibility of Overparametrized Neural Networks Trained
with Heavy-Tailed SGD
- Authors: Yijun Wan, Melih Barsbey, Abdellatif Zaidi, Umut Simsekli
- Abstract summary: We consider a one-hidden-layer neural network trained with gradient descent (SGD)
We show that if we inject additive heavy-tailed noise to the iterates at each, for any compression rate, there exists a level of overparametrization such that the output of the algorithm will be compressible with high probability.
- Score: 31.61477313262589
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Neural network compression has been an increasingly important subject, not
only due to its practical relevance, but also due to its theoretical
implications, as there is an explicit connection between compressibility and
generalization error. Recent studies have shown that the choice of the
hyperparameters of stochastic gradient descent (SGD) can have an effect on the
compressibility of the learned parameter vector. These results, however, rely
on unverifiable assumptions and the resulting theory does not provide a
practical guideline due to its implicitness. In this study, we propose a simple
modification for SGD, such that the outputs of the algorithm will be provably
compressible without making any nontrivial assumptions. We consider a
one-hidden-layer neural network trained with SGD, and show that if we inject
additive heavy-tailed noise to the iterates at each iteration, for any
compression rate, there exists a level of overparametrization such that the
output of the algorithm will be compressible with high probability. To achieve
this result, we make two main technical contributions: (i) we prove a
'propagation of chaos' result for a class of heavy-tailed stochastic
differential equations, and (ii) we derive error estimates for their Euler
discretization. Our experiments suggest that the proposed approach not only
achieves increased compressibility with various models and datasets, but also
leads to robust test performance under pruning, even in more realistic
architectures that lie beyond our theoretical setting.
Related papers
- On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction.
We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z) - Reducing Computational Complexity of Neural Networks in Optical Channel
Equalization: From Concepts to Implementation [1.6987798749419218]
We show that it is possible to design an NN-based equalizer that is simpler to implement and has better performance than the conventional digital back-propagation (DBP) equalizer with only one step per span.
An equalizer based on NN can also achieve superior performance while still maintaining the same degree of complexity as the full electronic chromatic dispersion compensation block.
arXiv Detail & Related papers (2022-08-26T21:00:05Z) - A Theoretical Understanding of Neural Network Compression from Sparse
Linear Approximation [37.525277809849776]
The goal of model compression is to reduce the size of a large neural network while retaining a comparable performance.
We use sparsity-sensitive $ell_q$-norm to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression.
We also develop adaptive algorithms for pruning each neuron in the network informed by our theory.
arXiv Detail & Related papers (2022-06-11T20:10:35Z) - Federated Random Reshuffling with Compression and Variance Reduction [0.0]
Random Reshuffling (RR) is an immensely popular method for training supervised machine learning models via empirical risk minimization.
It is embedded and often set as default in standard machine learning software.
We introduce three new algorithms to improve FedRR further: one for taming the variance coming from shuffling and the other for taming the variance due to compression.
arXiv Detail & Related papers (2022-05-08T16:46:11Z) - Unified Multivariate Gaussian Mixture for Efficient Neural Image
Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression.
We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective.
Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z) - Generalization of Neural Combinatorial Solvers Through the Lens of
Adversarial Robustness [68.97830259849086]
Most datasets only capture a simpler subproblem and likely suffer from spurious features.
We study adversarial robustness - a local generalization property - to reveal hard, model-specific instances and spurious features.
Unlike in other applications, where perturbation models are designed around subjective notions of imperceptibility, our perturbation models are efficient and sound.
Surprisingly, with such perturbations, a sufficiently expressive neural solver does not suffer from the limitations of the accuracy-robustness trade-off common in supervised learning.
arXiv Detail & Related papers (2021-10-21T07:28:11Z) - Heavy Tails in SGD and Compressibility of Overparametrized Neural
Networks [9.554646174100123]
We show that the dynamics of the gradient descent training algorithm has a key role in obtaining compressible networks.
We prove that the networks are guaranteed to be '$ell_p$-compressible', and the compression errors of different pruning techniques become arbitrarily small as the network size increases.
arXiv Detail & Related papers (2021-06-07T17:02:59Z) - Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms.
Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent.
underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - Consistency Regularization for Certified Robustness of Smoothed
Classifiers [89.72878906950208]
A recent technique of randomized smoothing has shown that the worst-case $ell$-robustness can be transformed into the average-case robustness.
We found that the trade-off between accuracy and certified robustness of smoothed classifiers can be greatly controlled by simply regularizing the prediction consistency over noise.
arXiv Detail & Related papers (2020-06-07T06:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.