Related papers: Cutting Some Slack for SGD with Adaptive Polyak Stepsizes

Cutting Some Slack for SGD with Adaptive Polyak Stepsizes

URL: http://arxiv.org/abs/2202.12328v1
Date: Thu, 24 Feb 2022 19:31:03 GMT
Title: Cutting Some Slack for SGD with Adaptive Polyak Stepsizes
Authors: Robert M. Gower and Mathieu Blondel and Nidham Gazagnadou and Fabian Pedregosa
Abstract summary: We consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods. We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems. We use this insight to develop new variants of the SPS method that are better suited to nonlinear models.
Score: 35.024680868164445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tuning the step size of stochastic gradient descent is tedious and error prone. This has motivated the development of methods that automatically adapt the step size using readily available information. In this paper, we consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods. These are methods that make use of gradient and loss value at the sampled points to adaptively adjust the step size. We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems. We use this insight to develop new variants of the SPS method that are better suited to nonlinear models. Our new variants are based on introducing a slack variable into the interpolation equations. This single slack variable tracks the loss function across iterations and is used in setting a stable step size. We provide extensive numerical results supporting our new methods and a convergence theory.

Related papers

Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
AdaBatchGrad: Combining Adaptive Batch Size and Adaptive Step Size [42.84471753630676]
We present a novel adaptation of the Gradient Descent (SGD) called AdaBatchGrad. It seamlessly integrates an adaptive step size with an adjustable batch size. We experimentally show, how the introduction of adaptive step size and adaptive batch size gradually improves the performance of regular SGD.
arXiv Detail & Related papers (2024-02-07T21:19:05Z)
Adaptive Learning Rates for Faster Stochastic Gradient Methods [6.935471115003109]
We propose new adaptive step size strategies that improve several quadratic convex gradient methods. Our first method is based on the classical Polyak step size (Polyak, 1987) and is an extension of the recent development of this method. Our second method, denoted GraDS, rescales step size by "diversity of gradients"
arXiv Detail & Related papers (2022-08-10T11:36:00Z)
The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance [46.15915820243487]
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft. We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
arXiv Detail & Related papers (2022-02-11T17:37:54Z)
Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search) [32.24244211281863]
We study a simplistic setting -- smooth, convex losses with models over- parameterized enough to interpolate the data. We prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate. We show that these techniques improve the convergence and generalization of adaptive gradient methods across tasks.
arXiv Detail & Related papers (2020-06-11T21:23:30Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction. We adaptively select the descent steps where the measure reduction is carried out. We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
Explore Aggressively, Update Conservatively: Stochastic Extragradient Methods with Variable Stepsize Scaling [34.35013145885164]
Extragradient methods have become a staple for solving large-scale saddlepoint problems in machine learning. We show in this paper that running vanilla extragradient with gradients may jeopardize its convergence, even in simple bilinear models. We show that this modification allows the method to converge even with gradients, and we derive sharp convergence rates under an error bound condition.
arXiv Detail & Related papers (2020-03-23T10:24:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.