A Loss Curvature Perspective on Training Instability in Deep Learning
- URL: http://arxiv.org/abs/2110.04369v1
- Date: Fri, 8 Oct 2021 20:25:48 GMT
- Title: A Loss Curvature Perspective on Training Instability in Deep Learning
- Authors: Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam
Neyshabur, David Cardoze, George Dahl, Zachary Nado, Orhan Firat
- Abstract summary: We study the evolution of the loss Hessian across many classification tasks in order to understand the effect curvature of the loss has on the training dynamics.
Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization.
- Score: 28.70491071044542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we study the evolution of the loss Hessian across many
classification tasks in order to understand the effect the curvature of the
loss has on the training dynamics. Whereas prior work has focused on how
different learning rates affect the loss Hessian observed during training, we
also analyze the effects of model initialization, architectural choices, and
common training heuristics such as gradient clipping and learning rate warmup.
Our results demonstrate that successful model and hyperparameter choices allow
the early optimization trajectory to either avoid -- or navigate out of --
regions of high curvature and into flatter regions that tolerate a higher
learning rate. Our results suggest a unifying perspective on how disparate
mitigation strategies for training instability ultimately address the same
underlying failure mode of neural network optimization, namely poor
conditioning. Inspired by the conditioning perspective, we show that learning
rate warmup can improve training stability just as much as batch normalization,
layer normalization, MetaInit, GradInit, and Fixup initialization.
Related papers
- Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature.
We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate.
We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Gradient constrained sharpness-aware prompt learning for vision-language
models [99.74832984957025]
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM)
By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness.
We propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp)
arXiv Detail & Related papers (2023-09-14T17:13:54Z) - Accelerated Training via Incrementally Growing Neural Networks using
Variance Transfer and Learning Rate Adaptation [34.7523496790944]
We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering the training dynamics.
We show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original budget for training.
arXiv Detail & Related papers (2023-06-22T07:06:45Z) - Continual Learning with Pretrained Backbones by Tuning in the Input
Space [44.97953547553997]
The intrinsic difficulty in adapting deep learning models to non-stationary environments limits the applicability of neural networks to real-world tasks.
We propose a novel strategy to make the fine-tuning procedure more effective, by avoiding to update the pre-trained part of the network and learning not only the usual classification head, but also a set of newly-introduced learnable parameters.
arXiv Detail & Related papers (2023-06-05T15:11:59Z) - On the Loss Landscape of Adversarial Training: Identifying Challenges
and How to Overcome Them [57.957466608543676]
We analyze the influence of adversarial training on the loss landscape of machine learning models.
We show that the adversarial loss landscape is less favorable to optimization, due to increased curvature and more scattered gradients.
arXiv Detail & Related papers (2020-06-15T13:50:23Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.