The Equilibrium Hypothesis: Rethinking implicit regularization in Deep
Neural Networks
- URL: http://arxiv.org/abs/2110.11749v1
- Date: Fri, 22 Oct 2021 12:49:31 GMT
- Title: The Equilibrium Hypothesis: Rethinking implicit regularization in Deep
Neural Networks
- Authors: Yizhang Lou, Chris Mingard, Soufiane Hayou
- Abstract summary: Modern Deep Neural Networks (DNNs) exhibit impressive generalization properties on a variety of tasks without explicit regularization.
Recent work by Baratin et al. (2021) sheds light on an intriguing implicit regularization effect, showing that some layers are much more aligned with data labels than other layers.
This suggests that as the network grows in depth and width, an implicit layer selection phenomenon occurs during training.
- Score: 1.7188280334580197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern Deep Neural Networks (DNNs) exhibit impressive generalization
properties on a variety of tasks without explicit regularization, suggesting
the existence of hidden regularization effects. Recent work by Baratin et al.
(2021) sheds light on an intriguing implicit regularization effect, showing
that some layers are much more aligned with data labels than other layers. This
suggests that as the network grows in depth and width, an implicit layer
selection phenomenon occurs during training. In this work, we provide the first
explanation for this alignment hierarchy. We introduce and empirically validate
the Equilibrium Hypothesis which states that the layers that achieve some
balance between forward and backward information loss are the ones with the
highest alignment to data labels. Our experiments demonstrate an excellent
match with the theoretical predictions.
Related papers
- Neural Rank Collapse: Weight Decay and Small Within-Class Variability
Yield Low-Rank Bias [4.829265670567825]
We show the presence of an intriguing neural rank collapse phenomenon, connecting the low-rank bias of trained networks with networks' neural collapse properties.
As the weight decay parameter grows, the rank of each layer in the network decreases proportionally to the within-class variability of the hidden-space embeddings of the previous layers.
arXiv Detail & Related papers (2024-02-06T13:44:39Z) - Deep Neural Networks Tend To Extrapolate Predictably [51.303814412294514]
neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs.
We observe that neural network predictions often tend towards a constant value as input data becomes increasingly OOD.
We show how one can leverage our insights in practice to enable risk-sensitive decision-making in the presence of OOD inputs.
arXiv Detail & Related papers (2023-10-02T03:25:32Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - What Does the Gradient Tell When Attacking the Graph Structure [44.44204591087092]
We present a theoretical demonstration revealing that attackers tend to increase inter-class edges due to the message passing mechanism of GNNs.
By connecting dissimilar nodes, attackers can more effectively corrupt node features, making such attacks more advantageous.
We propose an innovative attack loss that balances attack effectiveness and imperceptibility, sacrificing some attack effectiveness to attain greater imperceptibility.
arXiv Detail & Related papers (2022-08-26T15:45:20Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - The Low-Rank Simplicity Bias in Deep Networks [46.79964271742486]
We make a series of empirical observations that investigate and extend the hypothesis that deep networks are inductively biased to find solutions with lower effective rank embeddings.
We show that our claim holds true on finite width linear and non-linear models on practical learning paradigms and show that on natural data, these are often the solutions that generalize well.
arXiv Detail & Related papers (2021-03-18T17:58:02Z) - A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and
its Applications to Regularization [16.98526336526696]
We study the layerwise loss landscape by studying the eigenspectra of the Hessian at each layer.
In particular, our results show that the layerwise Hessian geometry is largely similar to the entire Hessian.
We propose a new regularizer based on the trace of the layerwise Hessian.
arXiv Detail & Related papers (2020-12-07T15:42:44Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - AL2: Progressive Activation Loss for Learning General Representations in
Classification Neural Networks [12.14537824884951]
We propose a novel regularization method that progressively penalizes the magnitude of activations during training.
Our method's effect on generalization is analyzed with label randomization tests and cumulative ablations.
arXiv Detail & Related papers (2020-03-07T18:38:46Z) - Revealing the Structure of Deep Neural Networks via Convex Duality [70.15611146583068]
We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of hidden layers.
We show that a set of optimal hidden layer weights for a norm regularized training problem can be explicitly found as the extreme points of a convex set.
We apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds.
arXiv Detail & Related papers (2020-02-22T21:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.