Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
- URL: http://arxiv.org/abs/2305.17212v4
- Date: Mon, 3 Jun 2024 15:57:47 GMT
- Title: Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
- Authors: Atli Kosson, Bettina Messmer, Martin Jaggi,
- Abstract summary: This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks.
We show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
- Score: 33.88586668321127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation -- a proxy for the effective learning rate -- across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with L2-regularization. Finally, we show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.
Related papers
- To update or not to update? Neurons at equilibrium in deep models [8.72305226979945]
Recent advances in deep learning showed that, with some a-posteriori information on fully-trained models, it is possible to match the same performance by simply training a subset of their parameters.
In this work we shift our focus from the single parameters to the behavior of the whole neuron, exploiting the concept of neuronal equilibrium (NEq)
The proposed approach has been tested on different state-of-the-art learning strategies and tasks, validating NEq and observing that the neuronal equilibrium depends on the specific learning setup.
arXiv Detail & Related papers (2022-07-19T08:07:53Z) - Improving Deep Neural Network Random Initialization Through Neuronal
Rewiring [14.484787903053208]
We show that a higher neuronal strength variance may decrease performance, while a lower neuronal strength variance usually improves it.
A new method is then proposed to rewire neuronal connections according to a preferential attachment (PA) rule based on their strength.
In this sense, PA only reorganizes connections, while preserving the magnitude and distribution of the weights.
arXiv Detail & Related papers (2022-07-17T11:52:52Z) - SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network [8.79431718760617]
Training with mini-batch SGD and weight decay induces a bias toward rank minimization in weight matrices.
We show that this bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay.
We empirically explore the connection between this bias and generalization, finding that it has a marginal effect on the test performance.
arXiv Detail & Related papers (2022-06-12T17:06:35Z) - Minimizing Control for Credit Assignment with Strong Feedback [65.59995261310529]
Current methods for gradient-based credit assignment in deep neural networks need infinitesimally small feedback signals.
We combine strong feedback influences on neural activity with gradient-based learning and show that this naturally leads to a novel view on neural network optimization.
We show that the use of strong feedback in DFC allows learning forward and feedback connections simultaneously, using a learning rule fully local in space and time.
arXiv Detail & Related papers (2022-04-14T22:06:21Z) - Dynamic Neural Diversification: Path to Computationally Sustainable
Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks.
We explore the diversity of the neurons within the hidden layer during the learning process.
We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z) - Self-organized criticality in neural networks [0.0]
We show that learning dynamics of neural networks is generically attracted towards a self-organized critical state.
Our results support the claim that the universe might be a neural network.
arXiv Detail & Related papers (2021-07-07T18:00:03Z) - Formalizing Generalization and Robustness of Neural Networks to Weight
Perturbations [58.731070632586594]
We provide the first formal analysis for feed-forward neural networks with non-negative monotone activation functions against weight perturbations.
We also design a new theory-driven loss function for training generalizable and robust neural networks against weight perturbations.
arXiv Detail & Related papers (2021-03-03T06:17:03Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Spherical Motion Dynamics: Learning Dynamics of Neural Network with
Normalization, Weight Decay, and SGD [105.99301967452334]
We show the learning dynamics of neural network with normalization, weight decay (WD), and SGD (with momentum) named as Spherical Motion Dynamics (SMD)
We verify our assumptions and theoretical results on various computer vision tasks including ImageNet and MSCOCO with standard settings.
arXiv Detail & Related papers (2020-06-15T14:16:33Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.