Acceleration via Fractal Learning Rate Schedules
- URL: http://arxiv.org/abs/2103.01338v1
- Date: Mon, 1 Mar 2021 22:52:13 GMT
- Title: Acceleration via Fractal Learning Rate Schedules
- Authors: Naman Agarwal, Surbhi Goel, Cyril Zhang
- Abstract summary: We show that the learning rate schedule remains notoriously difficult to understand and expensive to tune.
We reinterpret an iterative algorithm from the numerical analysis literature as what we call the Chebyshev learning rate schedule for accelerating vanilla gradient descent.
We provide some experiments and discussion to challenge current understandings of the "edge of stability" in deep learning.
- Score: 37.878672787331105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When balancing the practical tradeoffs of iterative methods for large-scale
optimization, the learning rate schedule remains notoriously difficult to
understand and expensive to tune. We demonstrate the presence of these
subtleties even in the innocuous case when the objective is a convex quadratic.
We reinterpret an iterative algorithm from the numerical analysis literature as
what we call the Chebyshev learning rate schedule for accelerating vanilla
gradient descent, and show that the problem of mitigating instability leads to
a fractal ordering of step sizes. We provide some experiments and discussion to
challenge current understandings of the "edge of stability" in deep learning:
even in simple settings, provable acceleration can be obtained by making
negative local progress on the objective.
Related papers
- Stepping on the Edge: Curvature Aware Learning Rate Tuners [24.95412499942206]
Curvature information is the largest eigenvalue of the loss Hessian, known as the sharpness.
Recent work has shown that curvature information undergoes complex dynamics during training.
We analyze the closed-loop feedback effect between learning rate tuning and curvature.
arXiv Detail & Related papers (2024-07-08T17:56:00Z) - On a continuous time model of gradient descent dynamics and instability
in deep learning [12.20253214080485]
We propose the principal flow (PF) as a continuous time flow that approximates gradient descent dynamics.
The PF sheds light on the recently observed edge of stability phenomena in deep learning.
Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
arXiv Detail & Related papers (2023-02-03T19:03:10Z) - Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field.
Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous.
We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - Stochastic Optimization under Distributional Drift [3.0229888038442922]
We provide non-asymptotic convergence guarantees for algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability.
We identify a low drift-to-noise regime in which the tracking efficiency of the gradient method benefits significantly from a step decay schedule.
arXiv Detail & Related papers (2021-08-16T21:57:39Z) - Robust learning with anytime-guaranteed feedback [6.903929927172917]
gradient-based learning algorithms are driven by queried feedback with almost no performance guarantees.
Here we explore a modified "anytime online-to-batch" mechanism which admits high-probability error bounds.
In practice, we show noteworthy gains on real-world data applications.
arXiv Detail & Related papers (2021-05-24T07:31:52Z) - Improved Analysis of Clipping Algorithms for Non-convex Optimization [19.507750439784605]
Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
arXiv Detail & Related papers (2020-10-05T14:36:59Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z) - Disentangling Adaptive Gradient Methods from Learning Rates [65.0397050979662]
We take a deeper look at how adaptive gradient methods interact with the learning rate schedule.
We introduce a "grafting" experiment which decouples an update's magnitude from its direction.
We present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods.
arXiv Detail & Related papers (2020-02-26T21:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.