Related papers: Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD

Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD

URL: http://arxiv.org/abs/2306.08125v2
Date: Mon, 12 Feb 2024 10:46:46 GMT
Title: Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD
Authors: Yijun Wan, Melih Barsbey, Abdellatif Zaidi, Umut Simsekli
Abstract summary: We consider a one-hidden-layer neural network trained with gradient descent (SGD) We show that if we inject additive heavy-tailed noise to the iterates at each, for any compression rate, there exists a level of overparametrization such that the output of the algorithm will be compressible with high probability.
Score: 31.61477313262589
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Neural network compression has been an increasingly important subject, not only due to its practical relevance, but also due to its theoretical implications, as there is an explicit connection between compressibility and generalization error. Recent studies have shown that the choice of the hyperparameters of stochastic gradient descent (SGD) can have an effect on the compressibility of the learned parameter vector. These results, however, rely on unverifiable assumptions and the resulting theory does not provide a practical guideline due to its implicitness. In this study, we propose a simple modification for SGD, such that the outputs of the algorithm will be provably compressible without making any nontrivial assumptions. We consider a one-hidden-layer neural network trained with SGD, and show that if we inject additive heavy-tailed noise to the iterates at each iteration, for any compression rate, there exists a level of overparametrization such that the output of the algorithm will be compressible with high probability. To achieve this result, we make two main technical contributions: (i) we prove a 'propagation of chaos' result for a class of heavy-tailed stochastic differential equations, and (ii) we derive error estimates for their Euler discretization. Our experiments suggest that the proposed approach not only achieves increased compressibility with various models and datasets, but also leads to robust test performance under pruning, even in more realistic architectures that lie beyond our theoretical setting.

Related papers

Iso-Diffusion: Improving Diffusion Probabilistic Models Using the Isotropy of the Additive Gaussian Noise [0.0]
We show how to use the isotropy of the additive noise as a constraint on the objective function to enhance the fidelity of DDPMs. Our approach is simple and can be applied to any DDPM variant.
arXiv Detail & Related papers (2024-03-25T14:05:52Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Disentangled Representation Learning with Transmitted Information Bottleneck [57.22757813140418]
We present textbfDisTIB (textbfTransmitted textbfInformation textbfBottleneck for textbfDisd representation learning), a novel objective that navigates the balance between information compression and preservation.
arXiv Detail & Related papers (2023-11-03T03:18:40Z)
Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction. We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue. In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z)
Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation [1.6987798749419218]
We show that it is possible to design an NN-based equalizer that is simpler to implement and has better performance than the conventional digital back-propagation (DBP) equalizer with only one step per span. An equalizer based on NN can also achieve superior performance while still maintaining the same degree of complexity as the full electronic chromatic dispersion compensation block.
arXiv Detail & Related papers (2022-08-26T21:00:05Z)
A Theoretical Understanding of Neural Network Compression from Sparse Linear Approximation [37.525277809849776]
The goal of model compression is to reduce the size of a large neural network while retaining a comparable performance. We use sparsity-sensitive $ell_q$-norm to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression. We also develop adaptive algorithms for pruning each neuron in the network informed by our theory.
arXiv Detail & Related papers (2022-06-11T20:10:35Z)
Federated Random Reshuffling with Compression and Variance Reduction [0.0]
Random Reshuffling (RR) is an immensely popular method for training supervised machine learning models via empirical risk minimization. It is embedded and often set as default in standard machine learning software. We introduce three new algorithms to improve FedRR further: one for taming the variance coming from shuffling and the other for taming the variance due to compression.
arXiv Detail & Related papers (2022-05-08T16:46:11Z)
Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression [151.3826781154146]
latent variables with priors and hyperpriors is an essential problem in variational image compression. We find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective. Our model has better rate-distortion performance and an impressive $3.18times$ compression speed up.
arXiv Detail & Related papers (2022-03-21T11:44:17Z)
Generalization of Neural Combinatorial Solvers Through the Lens of Adversarial Robustness [68.97830259849086]
Most datasets only capture a simpler subproblem and likely suffer from spurious features. We study adversarial robustness - a local generalization property - to reveal hard, model-specific instances and spurious features. Unlike in other applications, where perturbation models are designed around subjective notions of imperceptibility, our perturbation models are efficient and sound. Surprisingly, with such perturbations, a sufficiently expressive neural solver does not suffer from the limitations of the accuracy-robustness trade-off common in supervised learning.
arXiv Detail & Related papers (2021-10-21T07:28:11Z)
Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks [9.554646174100123]
We show that the dynamics of the gradient descent training algorithm has a key role in obtaining compressible networks. We prove that the networks are guaranteed to be '$ell_p$-compressible', and the compression errors of different pruning techniques become arbitrarily small as the network size increases.
arXiv Detail & Related papers (2021-06-07T17:02:59Z)
Linear Convergent Decentralized Optimization with Compression [50.44269451541387]
Existing decentralized algorithms with compression mainly focus on compressing DGD-type algorithms. Motivated by primal-dual algorithms, this paper proposes first underlineLinunderlineEAr convergent. underlineDecentralized with compression, LEAD.
arXiv Detail & Related papers (2020-07-01T04:35:00Z)
The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution. We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z)
Consistency Regularization for Certified Robustness of Smoothed Classifiers [89.72878906950208]
A recent technique of randomized smoothing has shown that the worst-case $ell$-robustness can be transformed into the average-case robustness. We found that the trade-off between accuracy and certified robustness of smoothed classifiers can be greatly controlled by simply regularizing the prediction consistency over noise.
arXiv Detail & Related papers (2020-06-07T06:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.