Good regularity creates large learning rate implicit biases: edge of
stability, balancing, and catapult
- URL: http://arxiv.org/abs/2310.17087v2
- Date: Tue, 12 Dec 2023 04:06:02 GMT
- Title: Good regularity creates large learning rate implicit biases: edge of
stability, balancing, and catapult
- Authors: Yuqing Wang, Zhenghao Xu, Tuo Zhao, Molei Tao
- Abstract summary: Large learning rates, when applied to objective descent for non optimization, yield various implicit biases including the edge of stability.
This paper provides an initial step in descent and shows that these implicit biases are in fact various tips same iceberg.
- Score: 49.8719617899285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large learning rates, when applied to gradient descent for nonconvex
optimization, yield various implicit biases including the edge of stability
(Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et
al., 2020). These phenomena cannot be well explained by classical optimization
theory. Though significant theoretical progress has been made in understanding
these implicit biases, it remains unclear for which objective functions would
they be more likely. This paper provides an initial step in answering this
question and also shows that these implicit biases are in fact various tips of
the same iceberg. To establish these results, we develop a global convergence
theory under large learning rates, for a family of nonconvex functions without
globally Lipschitz continuous gradient, which was typically assumed in existing
convergence analysis. Specifically, these phenomena are more likely to occur
when the optimization objective function has good regularity. This regularity,
together with gradient descent using a large learning rate that favors flatter
regions, results in these nontrivial dynamical behaviors. Another corollary is
the first non-asymptotic convergence rate bound for large-learning-rate
gradient descent optimization of nonconvex functions. Although our theory only
applies to specific functions so far, the possibility of extrapolating it to
neural networks is also experimentally validated, for which different choices
of loss, activation functions, and other techniques such as batch normalization
can all affect regularity significantly and lead to very different training
dynamics.
Related papers
- Understanding Stochastic Natural Gradient Variational Inference [12.800664845601197]
We show a global convergence rate of $mathcalO(frac1)$ implicitly a non-asymptotic convergence rate of NGVI.
The rate is no worse than descent (aka black-box variational inference) and has better constant dependency.
arXiv Detail & Related papers (2024-06-04T00:45:37Z) - On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Gradient is All You Need? [0.0]
In this paper we provide a novel analytical perspective on the theoretical understanding of learning algorithms by interpreting consensus-based gradient-based optimization (CBO)
Our results prove the intrinsic power of CBO to alleviate the complexities of the nonlocal landscape function.
arXiv Detail & Related papers (2023-06-16T11:30:55Z) - High-Probability Bounds for Stochastic Optimization and Variational
Inequalities: the Case of Unbounded Variance [59.211456992422136]
We propose algorithms with high-probability convergence results under less restrictive assumptions.
These results justify the usage of the considered methods for solving problems that do not fit standard functional classes in optimization.
arXiv Detail & Related papers (2023-02-02T10:37:23Z) - Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
Stochasticity [24.428843425522107]
We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow.
We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
arXiv Detail & Related papers (2021-06-17T14:16:04Z) - Linear Last-iterate Convergence in Constrained Saddle-point Optimization [48.44657553192801]
We significantly expand the understanding of last-rate uniqueness for Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU)
We show that when the equilibrium is unique, linear lastiterate convergence is achieved with a learning rate whose value is set to a universal constant.
We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption.
arXiv Detail & Related papers (2020-06-16T20:53:04Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.