Phase diagram of early training dynamics in deep neural networks: effect
of the learning rate, depth, and width
- URL: http://arxiv.org/abs/2302.12250v2
- Date: Tue, 24 Oct 2023 17:59:46 GMT
- Title: Phase diagram of early training dynamics in deep neural networks: effect
of the learning rate, depth, and width
- Authors: Dayal Singh Kalra and Maissam Barkeshli
- Abstract summary: We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with gradient descent (SGD)
We find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time edge of stability" regime.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We systematically analyze optimization dynamics in deep neural networks
(DNNs) trained with stochastic gradient descent (SGD) and study the effect of
learning rate $\eta$, depth $d$, and width $w$ of the neural network. By
analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss,
which is a measure of sharpness of the loss landscape, we find that the
dynamics can show four distinct regimes: (i) an early time transient regime,
(ii) an intermediate saturation regime, (iii) a progressive sharpening regime,
and (iv) a late time ``edge of stability" regime. The early and intermediate
regimes (i) and (ii) exhibit a rich phase diagram depending on $\eta \equiv c /
\lambda_0^H $, $d$, and $w$. We identify several critical values of $c$, which
separate qualitatively distinct phenomena in the early time dynamics of
training loss and sharpness. Notably, we discover the opening up of a
``sharpness reduction" phase, where sharpness decreases at early times, as $d$
and $1/w$ are increased.
Related papers
- Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective [66.80315289020487]
Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can continue indefinitely without a pre-specified compute budget.
We show that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom.
Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch.
arXiv Detail & Related papers (2024-10-07T16:49:39Z) - Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint [5.9954962391837885]
We study the gradient descent dynamics of neural networks through the lens of macroscopic limits.
Our study reveals that gradient descent can rapidly drive deep neural networks to zero training loss.
Our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm.
arXiv Detail & Related papers (2024-04-07T08:07:02Z) - Universal Sharpness Dynamics in Neural Network Training: Fixed Point
Analysis, Edge of Stability, and Route to Chaos [6.579523168465526]
In descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training.
We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios.
arXiv Detail & Related papers (2023-11-03T17:59:40Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Understanding Edge-of-Stability Training Dynamics with a Minimalist
Example [20.714857891192345]
Recently, researchers observed that descent for deep neural networks operates in an edge-of-stability'' (EoS) regime.
We give rigorous analysis for its dynamics in a large local region and explain why the final converging point has sharpness to $2/eta$.
arXiv Detail & Related papers (2022-10-07T02:57:05Z) - Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge
of Stability [8.492339290649031]
This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory.
We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics.
We provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks.
arXiv Detail & Related papers (2022-07-26T06:37:58Z) - Differentially private training of neural networks with Langevin
dynamics forcalibrated predictive uncertainty [58.730520380312676]
We show that differentially private gradient descent (DP-SGD) can yield poorly calibrated, overconfident deep learning models.
This represents a serious issue for safety-critical applications, e.g. in medical diagnosis.
arXiv Detail & Related papers (2021-07-09T08:14:45Z) - Vanishing Curvature and the Power of Adaptive Methods in Randomly
Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks.
We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.