Super-Convergence with an Unstable Learning Rate
- URL: http://arxiv.org/abs/2102.10734v1
- Date: Mon, 22 Feb 2021 02:05:47 GMT
- Title: Super-Convergence with an Unstable Learning Rate
- Authors: Samet Oymak
- Abstract summary: Conventional wisdom dictates that learning rate should be in the stable regime so that gradient-based algorithms don't blow up.
This note introduces a simple scenario where an unstable learning rate scheme leads to a super fast convergence.
Our scheme uses a Cyclical Learning Rate where we periodically take one large unstable step and several small stable steps to compensate for the instability.
- Score: 20.13887962999915
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Conventional wisdom dictates that learning rate should be in the stable
regime so that gradient-based algorithms don't blow up. This note introduces a
simple scenario where an unstable learning rate scheme leads to a super fast
convergence, with the convergence rate depending only logarithmically on the
condition number of the problem. Our scheme uses a Cyclical Learning Rate where
we periodically take one large unstable step and several small stable steps to
compensate for the instability. These findings also help explain the empirical
observations of [Smith and Topin, 2019] where they claim CLR with a large
maximum learning rate leads to "super-convergence". We prove that our scheme
excels in the problems where Hessian exhibits a bimodal spectrum and the
eigenvalues can be grouped into two clusters (small and large). The unstable
step is the key to enabling fast convergence over the small eigen-spectrum.
Related papers
- Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation [51.248784084461334]
We prove new convergence rates for a generalized version of Nesterov acceleration underrho conditions.
Our analysis reduces the dependence on the strong growth constant from $$ to $sqrt$ as compared to prior work.
arXiv Detail & Related papers (2024-04-03T00:41:19Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - The equivalence of dynamic and strategic stability under regularized
learning in games [33.74394172275373]
We examine the long-run behavior of regularized, no-regret learning in finite games.
We obtain an equivalence between strategic and dynamic stability.
We show that methods based on entropic regularization converge at a geometric rate.
arXiv Detail & Related papers (2023-11-04T14:07:33Z) - A Stability Principle for Learning under Non-Stationarity [1.1510009152620668]
We develop a versatile framework for statistical learning in non-stationary environments.
At the heart of our analysis lie two novel components: a measure of similarity between functions and a segmentation technique for dividing the non-stationary data sequence into quasi-stationary pieces.
arXiv Detail & Related papers (2023-10-27T17:53:53Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Quantifying non-stabilizerness via information scrambling [0.6993026261767287]
A method to quantify quantum resources is to use a class of functions called magic monotones and stabilizer entropies.
We numerically show the relation of these sampled correlators to different non-stabilizerness measures for both qubit and qutrit systems.
We put forward and simulate a protocol to measure the monotonic behaviour of magic for the time evolution of local Hamiltonians.
arXiv Detail & Related papers (2022-04-24T10:12:47Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Stability and Generalization of Stochastic Gradient Methods for Minimax
Problems [71.60601421935844]
Many machine learning problems can be formulated as minimax problems such as Generative Adversarial Networks (GANs)
We provide a comprehensive generalization analysis of examples from training gradient methods for minimax problems.
arXiv Detail & Related papers (2021-05-08T22:38:00Z) - Stability and Convergence of Stochastic Gradient Clipping: Beyond
Lipschitz Continuity and Smoothness [23.22461721824713]
Gradient clipping is a technique to stabilize the training process for problems prone to the exploding gradient problem.
This paper establishes both qualitative and quantitative results of the gradient clipped (sub)gradient method (SGD) for non-smooth convex functions.
We also study the convergence of a clipped method with momentum, which includes clipped SGD as a special case.
arXiv Detail & Related papers (2021-02-12T12:41:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.