The large learning rate phase of deep learning: the catapult mechanism
- URL: http://arxiv.org/abs/2003.02218v1
- Date: Wed, 4 Mar 2020 17:52:48 GMT
- Title: The large learning rate phase of deep learning: the catapult mechanism
- Authors: Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy
Gur-Ari
- Abstract summary: We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
- Score: 50.23041928811575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The choice of initial learning rate can have a profound effect on the
performance of deep networks. We present a class of neural networks with
solvable training dynamics, and confirm their predictions empirically in
practical deep learning settings. The networks exhibit sharply distinct
behaviors at small and large learning rates. The two regimes are separated by a
phase transition. In the small learning rate phase, training can be understood
using the existing theory of infinitely wide neural networks. At large learning
rates the model captures qualitatively distinct phenomena, including the
convergence of gradient descent dynamics to flatter minima. One key prediction
of our model is a narrow range of large, stable learning rates. We find good
agreement between our model's predictions and training dynamics in realistic
deep learning settings. Furthermore, we find that the optimal performance in
such settings is often found in the large learning rate phase. We believe our
results shed light on characteristics of models trained at different learning
rates. In particular, they fill a gap between existing wide neural network
theory, and the nonlinear, large learning rate, training dynamics relevant to
practice.
Related papers
- Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network.
Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z) - Unified Neural Network Scaling Laws and Scale-time Equivalence [10.918504301310753]
We present a novel theoretical characterization of how three factors -- model size, training time, and data volume -- interact to determine the performance of deep neural networks.
We first establish a theoretical and empirical equivalence between scaling the size of a neural network and increasing its training time proportionally.
We then combine scale-time equivalence with a linear model analysis of double descent to obtain a unified theoretical scaling law.
arXiv Detail & Related papers (2024-09-09T16:45:26Z) - Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron [3.069335774032178]
We use a dataset-process approach to derive flow equations describing learning.
We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron's learning curve.
This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.
arXiv Detail & Related papers (2024-09-05T17:58:28Z) - A simple theory for training response of deep neural networks [0.0]
Deep neural networks give us a powerful method to model the training dataset's relationship between input and output.
We show the training response consists of some different factors based on training stages, activation functions, or training methods.
In addition, we show feature space reduction as an effect of training dynamics, which can result in network fragility.
arXiv Detail & Related papers (2024-05-07T07:20:15Z) - Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning [77.82908213345864]
We find empirical evidence that learning rate transfer can be attributed to the fact that under $mu$P and its depth extension, the largest eigenvalue of the training loss Hessian is largely independent of the width and depth of the network.
We show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer.
arXiv Detail & Related papers (2024-02-27T12:28:01Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error.
We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z) - How to Train Your Neural Network: A Comparative Evaluation [1.3654846342364304]
We discuss and compare current state-of-the-art frameworks for large scale distributed deep learning.
We present empirical results comparing their performance on large image and language training tasks.
Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.
arXiv Detail & Related papers (2021-11-09T04:24:42Z) - Deep Active Learning by Leveraging Training Dynamics [57.95155565319465]
We propose a theory-driven deep active learning method (dynamicAL) which selects samples to maximize training dynamics.
We show that dynamicAL not only outperforms other baselines consistently but also scales well on large deep learning models.
arXiv Detail & Related papers (2021-10-16T16:51:05Z) - Sparse Meta Networks for Sequential Adaptation and its Application to
Adaptive Language Modelling [7.859988850911321]
We introduce Sparse Meta Networks -- a meta-learning approach to learn online sequential adaptation algorithms for deep neural networks.
We augment a deep neural network with a layer-specific fast-weight memory.
We demonstrate strong performance on a variety of sequential adaptation scenarios.
arXiv Detail & Related papers (2020-09-03T17:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.