Related papers: Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate

Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate

URL: http://arxiv.org/abs/2506.22479v1
Date: Sun, 22 Jun 2025 08:02:19 GMT
Title: Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate
Authors: Krisanu Sarkar,
Abstract summary: We introduce Hindsight-Guided Momentum, a first-order optimization algorithm that adaptively scales learning rates based on recent updates.<n>HGM addresses this by a hindsight mechanism that accelerates the learning rate between coherent and conflicting directions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Hindsight-Guided Momentum (HGM), a first-order optimization algorithm that adaptively scales learning rates based on the directional consistency of recent updates. Traditional adaptive methods, such as Adam or RMSprop , adapt learning dynamics using only the magnitude of gradients, often overlooking important geometric cues.Geometric cues refer to directional information, such as the alignment between current gradients and past updates, which reflects the local curvature and consistency of the optimization path. HGM addresses this by incorporating a hindsight mechanism that evaluates the cosine similarity between the current gradient and accumulated momentum. This allows it to distinguish between coherent and conflicting gradient directions, increasing the learning rate when updates align and reducing it in regions of oscillation or noise. The result is a more responsive optimizer that accelerates convergence in smooth regions of the loss surface while maintaining stability in sharper or more erratic areas. Despite this added adaptability, the method preserves the computational and memory efficiency of existing optimizers.By more intelligently responding to the structure of the optimization landscape, HGM provides a simple yet effective improvement over existing approaches, particularly in non-convex settings like that of deep neural network training.

Related papers

Online Learning-guided Learning Rate Adaptation via Gradient Alignment [25.688764889273237]
The performance of an on large-scale deep learning models depends critically on fine-tuning the learning rate.<n>We propose a principled framework called GALA (Gradient Alignment-based Adaptation) which adjusts by tracking the alignment between consecutive gradients and a local curvature estimate.<n>When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning schedule.
arXiv Detail & Related papers (2025-06-10T03:46:41Z)
AYLA: Amplifying Gradient Sensitivity via Loss Transformation in Non-Convex Optimization [0.0]
Gradient Descent (SGD) and its variants, such as ADAM, are foundational to deep learning optimization.<n>This paper introduces AYLA, a novel framework that enhances dynamics training.
arXiv Detail & Related papers (2025-04-02T16:31:39Z)
Understanding Optimization in Deep Learning with Central Flows [53.66160508990508]
We show that an RMS's implicit behavior can be explicitly captured by a "central flow:" a differential equation. We show that these flows can empirically predict long-term optimization trajectories of generic neural networks.
arXiv Detail & Related papers (2024-10-31T17:58:13Z)
Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions. Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms. We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z)
Adaptive Friction in Deep Learning: Enhancing Optimizers with Sigmoid and Tanh Function [0.0]
We introduce sigSignGrad and tanhSignGrad, two novel gradients that integrate adaptive friction coefficients. Our theoretical analysis demonstrates the wide-ranging adjustment capability of the friction coefficient S. Experiments on CIFAR-10, Mini-Image-Net using ResNet50 and ViT architectures confirm the superior performance our proposeds.
arXiv Detail & Related papers (2024-08-07T03:20:46Z)
Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks [5.507301894089302]
This paper is the first attempt to study a new optimization technique for deep neural networks, using the sum normalization of a gradient vector as coefficients. The proposed technique is hence named as the adaptive gradient regularization (AGR)
arXiv Detail & Related papers (2024-07-24T02:23:18Z)
Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy [75.15685966213832]
We analyze the rich directional structure of optimization trajectories represented by their pointwise parameters. We show that training only scalar batchnorm parameters some while into training matches the performance of training the entire network.
arXiv Detail & Related papers (2024-03-12T07:32:47Z)
Signal Processing Meets SGD: From Momentum to Filter [6.751292200515355]
In deep learning, gradient descent (SGD) and its momentum-based variants are widely used for optimization.<n>In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence updates.<n>We introduce a novel method SGDF based on Wiener Filter principles, which derives an optimal time-varying gain to refine updates.
arXiv Detail & Related papers (2023-11-06T01:41:46Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.