AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2505.21651v1
- Date: Tue, 27 May 2025 18:25:21 GMT
- Title: AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent
- Authors: Nikola Surjanovic, Alexandre Bouchard-Côté, Trevor Campbell,
- Abstract summary: We introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration.<n> Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.
- Score: 58.05410015124021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The learning rate is an important tuning parameter for stochastic gradient descent (SGD) and can greatly influence its performance. However, appropriate selection of a learning rate schedule across all iterations typically requires a non-trivial amount of user tuning effort. To address this, we introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration and then takes appropriate action. We introduce theory supporting the convergence of AutoSGD, along with its deterministic counterpart for standard gradient descent. Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.
Related papers
- Online Learning-guided Learning Rate Adaptation via Gradient Alignment [25.688764889273237]
The performance of an on large-scale deep learning models depends critically on fine-tuning the learning rate.<n>We propose a principled framework called GALA (Gradient Alignment-based Adaptation) which adjusts by tracking the alignment between consecutive gradients and a local curvature estimate.<n>When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning schedule.
arXiv Detail & Related papers (2025-06-10T03:46:41Z) - Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate [105.86576388991713]
We introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives.<n>We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets.
arXiv Detail & Related papers (2024-10-29T14:41:44Z) - Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates [3.6185342807265415]
Deep learning algorithms are the key ingredients in many artificial intelligence (AI) systems.
Deep learning algorithms are typically consisting of a class of deep neural networks trained by a gradient descent (SGD) optimization method.
arXiv Detail & Related papers (2024-07-11T00:10:35Z) - Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization [0.6906005491572401]
We show that noise in batch descent (SGD) has the effect of smoothing objective function.<n>We show that there is an interesting relationship between the degree of smoothing by SGDs noise, and the well-studied sharpness' indicator.
arXiv Detail & Related papers (2023-11-15T07:27:40Z) - Mechanic: A Learning Rate Tuner [52.4242550204696]
We introduce a technique for tuning the learning rate scale factor of any base optimization algorithm and schedule automatically, which we call textscmechanic.
We rigorously evaluate textscmechanic on a range of large scale deep learning tasks with varying batch sizes, schedules, and base optimization algorithms.
arXiv Detail & Related papers (2023-05-31T19:32:43Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Automatic Tuning of Stochastic Gradient Descent with Bayesian
Optimisation [8.340191147575307]
We introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation.
It flexibly adjusts to abrupt changes of behaviours induced by new learning rate values.
It is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks, as well as warm-starting it for a new task.
arXiv Detail & Related papers (2020-06-25T13:18:18Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Statistical Adaptive Stochastic Gradient Methods [34.859895010071234]
We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in gradient methods.
SALSA first uses a smoothed line-search procedure to gradually increase the learning rate, then automatically decreases the learning rate.
The method for decreasing the learning rate is based on a new statistical test for detecting station switches when using a constant step size.
arXiv Detail & Related papers (2020-02-25T00:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.