The Break-Even Point on Optimization Trajectories of Deep Neural
Networks
- URL: http://arxiv.org/abs/2002.09572v1
- Date: Fri, 21 Feb 2020 22:55:51 GMT
- Title: The Break-Even Point on Optimization Trajectories of Deep Neural
Networks
- Authors: Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit,
Jacek Tabor, Kyunghyun Cho, Krzysztof Geras
- Abstract summary: We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
- Score: 64.7563588124004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The early phase of training of deep neural networks is critical for their
final performance. In this work, we study how the hyperparameters of stochastic
gradient descent (SGD) used in the early phase of training affect the rest of
the optimization trajectory. We argue for the existence of the "break-even"
point on this trajectory, beyond which the curvature of the loss surface and
noise in the gradient are implicitly regularized by SGD. In particular, we
demonstrate on multiple classification tasks that using a large learning rate
in the initial phase of training reduces the variance of the gradient, and
improves the conditioning of the covariance of gradients. These effects are
beneficial from the optimization perspective and become visible after the
break-even point. Complementing prior work, we also show that using a low
learning rate results in bad conditioning of the loss surface even for a neural
network with batch normalization layers. In short, our work shows that key
properties of the loss surface are strongly influenced by SGD in the early
phase of training. We argue that studying the impact of the identified effects
on generalization is a promising future direction.
Related papers
- Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks [15.691263438655842]
Spiking Neural Network (SNN) is a biologically inspired neural network infrastructure that has recently garnered significant attention.
Training an SNN directly poses a challenge due to the undefined gradient of the firing spike process.
We propose a shortcut back-propagation method in our paper, which advocates for transmitting the gradient directly from the loss to the shallow layers.
arXiv Detail & Related papers (2024-01-09T10:54:41Z) - Inference and Interference: The Role of Clipping, Pruning and Loss
Landscapes in Differentially Private Stochastic Gradient Descent [13.27004430044574]
Differentially private gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks.
We compare the behavior of the two processes separately in early and late epochs.
We find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result.
arXiv Detail & Related papers (2023-11-12T13:31:35Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Early Stage Convergence and Global Convergence of Training Mildly
Parameterized Neural Networks [3.148524502470734]
We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast.
We use a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient.
arXiv Detail & Related papers (2022-06-05T09:56:50Z) - A Loss Curvature Perspective on Training Instability in Deep Learning [28.70491071044542]
We study the evolution of the loss Hessian across many classification tasks in order to understand the effect curvature of the loss has on the training dynamics.
Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization.
arXiv Detail & Related papers (2021-10-08T20:25:48Z) - Gradient Descent on Neural Networks Typically Occurs at the Edge of
Stability [94.4070247697549]
Full-batch gradient descent on neural network training objectives operates in a regime we call the Edge of Stability.
In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / text(step size)$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales.
arXiv Detail & Related papers (2021-02-26T22:08:19Z) - A Flatter Loss for Bias Mitigation in Cross-dataset Facial Age
Estimation [37.107335288543624]
We advocate a cross-dataset protocol for age estimation benchmarking.
We propose a novel loss function that is more effective for neural network training.
arXiv Detail & Related papers (2020-10-20T15:22:29Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.