Reconciling Modern Deep Learning with Traditional Optimization Analyses:
The Intrinsic Learning Rate
- URL: http://arxiv.org/abs/2010.02916v1
- Date: Tue, 6 Oct 2020 17:58:29 GMT
- Title: Reconciling Modern Deep Learning with Traditional Optimization Analyses:
The Intrinsic Learning Rate
- Authors: Zhiyuan Li, Kaifeng Lyu, Sanjeev Arora
- Abstract summary: Recent works suggest that the use of Batch Normalization in today's deep learning can move it far from a traditional optimization viewpoint.
This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints.
We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
- Score: 36.83448475700536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular
normalization schemes (including Batch Normalization) in today's deep learning
can move it far from a traditional optimization viewpoint, e.g., use of
exponentially increasing learning rates. The current paper highlights other
ways in which behavior of normalized nets departs from traditional viewpoints,
and then initiates a formal framework for studying their mathematics via
suitable adaptation of the conventional framework namely, modeling SGD-induced
training trajectory via a suitable stochastic differential equation (SDE) with
a noise term that captures gradient noise. This yields: (a) A new ' intrinsic
learning rate' parameter that is the product of the normal learning rate and
weight decay factor. Analysis of the SDE shows how the effective speed of
learning varies and equilibrates over time under the control of intrinsic LR.
(b) A challenge -- via theory and experiments -- to popular belief that good
generalization requires large learning rates at the start of training. (c) New
experiments, backed by mathematical intuition, suggesting the number of steps
to equilibrium (in function space) scales as the inverse of the intrinsic
learning rate, as opposed to the exponential time convergence bound implied by
SDE analysis. We name it the Fast Equilibrium Conjecture and suggest it holds
the key to why Batch Normalization is effective.
Related papers
- Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature.
We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate.
We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z) - Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise [16.12834917344859]
It is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings.
We show that heavy-ball momentum can provide $tildemathcalO(sqrtkappa)$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate.
This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning.
arXiv Detail & Related papers (2023-12-22T09:58:39Z) - The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes.
Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z) - Distributional Gradient Matching for Learning Uncertain Neural Dynamics
Models [38.17499046781131]
We propose a novel approach towards estimating uncertain neural ODEs, avoiding the numerical integration bottleneck.
Our algorithm - distributional gradient matching (DGM) - jointly trains a smoother and a dynamics model and matches their gradients via minimizing a Wasserstein loss.
Our experiments show that, compared to traditional approximate inference methods based on numerical integration, our approach is faster to train, faster at predicting previously unseen trajectories, and in the context of neural ODEs, significantly more accurate.
arXiv Detail & Related papers (2021-06-22T08:40:51Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Training Generative Adversarial Networks by Solving Ordinary
Differential Equations [54.23691425062034]
We study the continuous-time dynamics induced by GAN training.
From this perspective, we hypothesise that instabilities in training GANs arise from the integration error.
We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training.
arXiv Detail & Related papers (2020-10-28T15:23:49Z) - Momentum via Primal Averaging: Theoretical Insights and Learning Rate
Schedules for Non-Convex Optimization [10.660480034605241]
Momentum methods are now used pervasively within the machine learning community for non-training models such as deep neural networks.
In this work we develop a Lyapunov analysis of SGD with momentum, by utilizing the SGD equivalent rewriting of the primal SGD method known as the SGDSPA) form.
arXiv Detail & Related papers (2020-10-01T13:46:32Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.