Related papers: Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks

Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks

URL: http://arxiv.org/abs/2304.02911v2
Date: Fri, 7 Apr 2023 04:59:08 GMT
Title: Heavy-Tailed Regularization of Weight Matrices in Deep Neural Networks
Authors: Xuanzhe Xiao, Zeng Li, Chuanlong Xie, Fengwei Zhou
Abstract summary: Key finding indicates that the generalization performance of a neural network is associated with the degree of heavy tails in the spectrum of its weight matrices. We introduce a novel regularization technique, termed Heavy-Tailed Regularization, which explicitly promotes a more heavy-tailed spectrum in the weight matrix through regularization. We empirically show that heavytailed regularization outperforms conventional regularization techniques in terms of generalization performance.
Score: 8.30897399932868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unraveling the reasons behind the remarkable success and exceptional generalization capabilities of deep neural networks presents a formidable challenge. Recent insights from random matrix theory, specifically those concerning the spectral analysis of weight matrices in deep neural networks, offer valuable clues to address this issue. A key finding indicates that the generalization performance of a neural network is associated with the degree of heavy tails in the spectrum of its weight matrices. To capitalize on this discovery, we introduce a novel regularization technique, termed Heavy-Tailed Regularization, which explicitly promotes a more heavy-tailed spectrum in the weight matrix through regularization. Firstly, we employ the Weighted Alpha and Stable Rank as penalty terms, both of which are differentiable, enabling the direct calculation of their gradients. To circumvent over-regularization, we introduce two variations of the penalty function. Then, adopting a Bayesian statistics perspective and leveraging knowledge from random matrices, we develop two novel heavy-tailed regularization methods, utilizing Powerlaw distribution and Frechet distribution as priors for the global spectrum and maximum eigenvalues, respectively. We empirically show that heavytailed regularization outperforms conventional regularization techniques in terms of generalization performance.

Related papers

A Theory of Initialisation's Impact on Specialisation [13.486658531315213]
We show that weight imbalance and high weight entropy can favour specialised solutions. We then show the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks.
arXiv Detail & Related papers (2025-03-04T11:39:30Z)
Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks [9.948870430491738]
We study the implicit bias towards low-rank weight matrices when training neural networks with Weight Decay (WD) Our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
arXiv Detail & Related papers (2024-10-03T03:36:18Z)
Optimization and Generalization Guarantees for Weight Normalization [19.965963460750206]
We provide the first theoretical characterizations of both optimization and generalization of deep WeightNorm models. We present experimental results which illustrate how the normalization terms and other quantities of theoretical interest relate to the training of WeightNorm networks.
arXiv Detail & Related papers (2024-09-13T15:55:05Z)
High-Dimensional Kernel Methods under Covariate Shift: Data-Dependent Implicit Regularization [83.06112052443233]
This paper studies kernel ridge regression in high dimensions under covariate shifts. By a bias-variance decomposition, we theoretically demonstrate that the re-weighting strategy allows for decreasing the variance. For bias, we analyze the regularization of the arbitrary or well-chosen scale, showing that the bias can behave very differently under different regularization scales.
arXiv Detail & Related papers (2024-06-05T12:03:27Z)
Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z)
Quantum tomography of helicity states for general scattering processes [55.2480439325792]
Quantum tomography has become an indispensable tool in order to compute the density matrix $rho$ of quantum systems in Physics. We present the theoretical framework for reconstructing the helicity quantum initial state of a general scattering process.
arXiv Detail & Related papers (2023-10-16T21:23:42Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Spectral Regularization: an Inductive Bias for Sequence Modeling [7.365884062005811]
This paper presents a spectral regularization technique, which attaches a unique inductive bias to sequence modeling. From fundamental connections between Hankel matrices and regular grammars, we propose to use the trace norm of the Hankel matrix, the tightest convex relaxation of its rank, as the spectral regularizer.
arXiv Detail & Related papers (2022-11-04T04:07:05Z)
Explicit regularization and implicit bias in deep network classifiers trained with the square loss [2.8935588665357077]
Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques are used together with Weight Decay.
arXiv Detail & Related papers (2020-12-31T21:07:56Z)
Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions. We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z)
Beyond Random Matrix Theory for Deep Networks [0.7614628596146599]
We investigate whether Wigner semi-circle and Marcenko-Pastur distributions, often used for deep neural network theoretical analysis, match empirically observed spectral densities. We find that even allowing for outliers, the observed spectral shapes strongly deviate from such theoretical predictions. We consider two new classes of matrix ensembles; random Wigner/Wishart ensemble products and percolated Wigner/Wishart ensembles, both of which better match observed spectra.
arXiv Detail & Related papers (2020-06-13T21:00:30Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.