Related papers: Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam

Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam

URL: http://arxiv.org/abs/2508.05408v1
Date: Thu, 07 Aug 2025 13:59:47 GMT
Title: Cumulative Learning Rate Adaptation: Revisiting Path-Based Schedules for SGD and Adam
Authors: Asma Atamna, Tom Maus, Fabian Kievelitz, Tobias Glasmachers,
Abstract summary: adaptive learning rate mechanisms adjust step sizes dynamically in response to the loss landscape.<n>We revisit a cumulative path-based adaptation scheme proposed in 2017, which adjusts the learning rate based on the discrepancy between the observed path length.<n>Our results aim to clarify when and why such adaptive strategies offer practical benefits.
Score: 0.7874708385247353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The learning rate is a crucial hyperparameter in deep learning, with its ideal value depending on the problem and potentially changing during training. In this paper, we investigate the practical utility of adaptive learning rate mechanisms that adjust step sizes dynamically in response to the loss landscape. We revisit a cumulative path-based adaptation scheme proposed in 2017, which adjusts the learning rate based on the discrepancy between the observed path length, computed as a time-discounted sum of normalized gradient steps, and the expected length of a random walk. While the original approach offers a compelling intuition, we show that its adaptation mechanism for Adam is conceptually inconsistent due to the optimizer's internal preconditioning. We propose a corrected variant that better reflects Adam's update dynamics. To assess the practical value of online learning rate adaptation, we benchmark SGD and Adam, with and without cumulative adaptation, and compare them to a recent alternative method. Our results aim to clarify when and why such adaptive strategies offer practical benefits.

Related papers

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models [88.47454470043552]
We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. Online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights.
arXiv Detail & Related papers (2024-03-03T14:03:48Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum [97.84312669132716]
We disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection. Our experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.
arXiv Detail & Related papers (2020-06-29T05:21:02Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS) Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
Disentangling Adaptive Gradient Methods from Learning Rates [65.0397050979662]
We take a deeper look at how adaptive gradient methods interact with the learning rate schedule. We introduce a "grafting" experiment which decouples an update's magnitude from its direction. We present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods.
arXiv Detail & Related papers (2020-02-26T21:42:49Z)
On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods [30.084554989542475]
We present a new framework for Adam-type methods with the trend information when updating the parameters with the adaptive step size and gradients. We show empirically the importance of adding the trend component, where our framework outperforms the conventional Adam and AMSGrad methods constantly on the classical models with several real-world datasets.
arXiv Detail & Related papers (2020-01-17T01:23:23Z)
A Dynamic Sampling Adaptive-SGD Method for Machine Learning [8.173034693197351]
We propose a method that adaptively controls the batch size used in the computation of gradient approximations and the step size used to move along such directions. The proposed method exploits local curvature information and ensures that search directions are descent directions with high probability. Numerical experiments show that this method is able to choose the best learning rates and compares favorably to fine-tuned SGD for training logistic regression and DNNs.
arXiv Detail & Related papers (2019-12-31T15:36:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.