Optimal learning rate schedules in high-dimensional non-convex
optimization problems
- URL: http://arxiv.org/abs/2202.04509v1
- Date: Wed, 9 Feb 2022 15:15:39 GMT
- Title: Optimal learning rate schedules in high-dimensional non-convex
optimization problems
- Authors: St\'ephane d'Ascoli, Maria Refinetti, Giulio Biroli
- Abstract summary: Learning rate schedules are ubiquitously used to speed up and improve optimisation.
We present a first analytical study of the role of neural scheduling in this setting.
- Score: 14.058580956992051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning rate schedules are ubiquitously used to speed up and improve
optimisation. Many different policies have been introduced on an empirical
basis, and theoretical analyses have been developed for convex settings.
However, in many realistic problems the loss-landscape is high-dimensional and
non convex -- a case for which results are scarce. In this paper we present a
first analytical study of the role of learning rate scheduling in this setting,
focusing on Langevin optimization with a learning rate decaying as
$\eta(t)=t^{-\beta}$. We begin by considering models where the loss is a
Gaussian random function on the $N$-dimensional sphere ($N\rightarrow \infty$),
featuring an extensive number of critical points. We find that to speed up
optimization without getting stuck in saddles, one must choose a decay rate
$\beta<1$, contrary to convex setups where $\beta=1$ is generally optimal. We
then add to the problem a signal to be recovered. In this setting, the dynamics
decompose into two phases: an \emph{exploration} phase where the dynamics
navigates through rough parts of the landscape, followed by a
\emph{convergence} phase where the signal is detected and the dynamics enter a
convex basin. In this case, it is optimal to keep a large learning rate during
the exploration phase to escape the non-convex region as quickly as possible,
then use the convex criterion $\beta=1$ to converge rapidly to the solution.
Finally, we demonstrate that our conclusions hold in a common regression task
involving neural networks.
Related papers
- Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods.
In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z) - Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity [59.75300530380427]
We consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries.
We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds.
arXiv Detail & Related papers (2024-06-28T02:56:22Z) - Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
Gradient clipping is a technique used in deep learning applications such as large-scale language modeling.
Recent experimental training have a fairly special behavior in that it mitigates order complexity.
arXiv Detail & Related papers (2023-03-02T00:57:38Z) - Blessing of Nonconvexity in Deep Linear Models: Depth Flattens the
Optimization Landscape Around the True Solution [4.7464518249313805]
This work characterizes the effect of depth on the optimization landscape of regression.
We show that, despite their non neurality, deeper models have more desirable optimization.
arXiv Detail & Related papers (2022-07-15T17:11:26Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic
Objectives with Skewed Hessian Spectrums [26.44093918424658]
Eigencurve is the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives.
Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks.
Two simple learning rate schedulers for practical applications can approximate Eigencurve.
arXiv Detail & Related papers (2021-10-27T01:17:53Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Better Theory for SGD in the Nonconvex World [2.6397379133308214]
Large-scale non optimization problems are ubiquitous in modern machine learning.
We perform experiments on the effects of a wide array of synthetic minibatch sizes on the Gradient Descent (SG) problem.
arXiv Detail & Related papers (2020-02-09T09:56:06Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.