Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic
Objectives with Skewed Hessian Spectrums
- URL: http://arxiv.org/abs/2110.14109v2
- Date: Thu, 28 Oct 2021 00:59:30 GMT
- Title: Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic
Objectives with Skewed Hessian Spectrums
- Authors: Rui Pan, Haishan Ye, Tong Zhang
- Abstract summary: Eigencurve is the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives.
Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks.
Two simple learning rate schedulers for practical applications can approximate Eigencurve.
- Score: 26.44093918424658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning rate schedulers have been widely adopted in training deep neural
networks. Despite their practical importance, there is a discrepancy between
its practice and its theoretical analysis. For instance, it is not known what
schedules of SGD achieve best convergence, even for simple problems such as
optimizing quadratic objectives. So far, step decay has been one of the
strongest candidates under this setup, which is proved to be nearly optimal
with a $\mathcal{O}(\log T)$ gap. However, according to our analysis, this gap
turns out to be $\Omega(\log T)$ in a wide range of settings, which throws the
schedule optimality problem into an open question again. Towards answering this
reopened question, in this paper, we propose Eigencurve, the first family of
learning rate schedules that can achieve minimax optimal convergence rates (up
to a constant) for SGD on quadratic objectives when the eigenvalue distribution
of the underlying Hessian matrix is skewed. The condition is quite common in
practice. Experimental results show that Eigencurve can significantly
outperform step decay in image classification tasks on CIFAR-10, especially
when the number of epochs is small. Moreover, the theory inspires two simple
learning rate schedulers for practical applications that can approximate
Eigencurve. For some problems, the optimal shape of the proposed schedulers
resembles that of cosine decay, which sheds light to the success of cosine
decay for such situations. For other situations, the proposed schedulers are
superior to cosine decay.
Related papers
- Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory.
We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z) - Delayed Geometric Discounts: An Alternative Criterion for Reinforcement
Learning [1.52292571922932]
reinforcement learning (RL) proposes a theoretical background to learn optimal behaviors.
In practice, RL algorithms rely on geometric discounts to evaluate this optimality.
In this paper, we tackle these issues by generalizing the discounted problem formulation with a family of delayed objective functions.
arXiv Detail & Related papers (2022-09-26T07:49:38Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Optimal learning rate schedules in high-dimensional non-convex
optimization problems [14.058580956992051]
Learning rate schedules are ubiquitously used to speed up and improve optimisation.
We present a first analytical study of the role of neural scheduling in this setting.
arXiv Detail & Related papers (2022-02-09T15:15:39Z) - PDE-Based Optimal Strategy for Unconstrained Online Learning [40.61498562988079]
We present a framework that generates time-varying potential functions by solving a Partial Differential Equation (PDE)
Our framework recovers some classical potentials, and more importantly provides a systematic approach to design new ones.
This is the first parameter-free algorithm with optimal leading constant.
arXiv Detail & Related papers (2022-01-19T22:21:21Z) - STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization [74.1615979057429]
We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions.
Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
arXiv Detail & Related papers (2021-11-01T15:43:36Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis
and Application to Actor-Critic [142.1492359556374]
Bilevel optimization is a class of problems which exhibit a two-level structure.
We propose a two-timescale approximation (TTSA) algorithm for tackling such a bilevel problem.
We show that a two-timescale natural actor-critic policy optimization algorithm can be viewed as a special case of our TTSA framework.
arXiv Detail & Related papers (2020-07-10T05:20:02Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.