Related papers: Online Learning-guided Learning Rate Adaptation via Gradient Alignment

Online Learning-guided Learning Rate Adaptation via Gradient Alignment

URL: http://arxiv.org/abs/2506.08419v1
Date: Tue, 10 Jun 2025 03:46:41 GMT
Title: Online Learning-guided Learning Rate Adaptation via Gradient Alignment
Authors: Ruichen Jiang, Ali Kavis, Aryan Mokhtari,
Abstract summary: The performance of an on large-scale deep learning models depends critically on fine-tuning the learning rate.<n>We propose a principled framework called GALA (Gradient Alignment-based Adaptation) which adjusts by tracking the alignment between consecutive gradients and a local curvature estimate.<n>When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning schedule.
Score: 25.688764889273237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The performance of an optimizer on large-scale deep learning models depends critically on fine-tuning the learning rate, often requiring an extensive grid search over base learning rates, schedules, and other hyperparameters. In this paper, we propose a principled framework called GALA (Gradient Alignment-based Learning rate Adaptation), which dynamically adjusts the learning rate by tracking the alignment between consecutive gradients and using a local curvature estimate. Guided by the convergence analysis, we formulate the problem of selecting the learning rate as a one-dimensional online learning problem. When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning rate schedule that tends to increase when consecutive gradients are aligned and decrease otherwise. We establish a data-adaptive convergence rate for normalized SGD equipped with GALA in the smooth, nonconvex setting. Empirically, common optimizers such as SGD and Adam, when augmented with GALA, demonstrate robust performance across a wide range of initial learning rates and perform competitively without the need for tuning.

Related papers

Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate [0.0]
We introduce Hindsight-Guided Momentum, a first-order optimization algorithm that adaptively scales learning rates based on recent updates.<n>HGM addresses this by a hindsight mechanism that accelerates the learning rate between coherent and conflicting directions.
arXiv Detail & Related papers (2025-06-22T08:02:19Z)
AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent [58.05410015124021]
We introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration.<n> Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.
arXiv Detail & Related papers (2025-05-27T18:25:21Z)
Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z)
Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses [5.052293146674794]
It is known that the standard descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam fail to converge if the learning rates do not converge to zero. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates.
arXiv Detail & Related papers (2024-06-20T14:07:39Z)
The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms [8.681909776958184]
We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm on the least squares problem.
arXiv Detail & Related papers (2024-05-30T00:27:52Z)
On discretisation drift and smoothness regularisation in neural network training [0.0]
We aim to make steps towards an improved understanding of deep learning with a focus on optimisation and model regularisation. We start by investigating gradient descent (GD), a discrete-time algorithm at the basis of most popular deep learning optimisation algorithms. We derive novel continuous-time flows that account for discretisation drift. Unlike the NGF, these new flows can be used to describe learning rate specific behaviours of GD, such as training instabilities observed in supervised learning and two-player games. We then translate insights from continuous time into mitigation strategies for unstable GD dynamics, by constructing novel learning rate schedules and regulariser
arXiv Detail & Related papers (2023-10-21T15:21:36Z)
FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method. We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate. We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS) Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems [91.43582419264763]
We study the problem of system identification and adaptive control in partially observable linear dynamical systems. We present the first model estimation method with finite-time guarantees in both open and closed-loop system identification. We show that AdaptOn is the first algorithm that achieves $textpolylogleft(Tright)$ regret in adaptive control of unknown partially observable linear dynamical systems.
arXiv Detail & Related papers (2020-03-25T06:00:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.