Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
- URL: http://arxiv.org/abs/2510.04988v2
- Date: Thu, 09 Oct 2025 20:53:07 GMT
- Title: Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
- Authors: Kristi Topollai, Anna Choromanska,
- Abstract summary: We introduce an textitimatadaptive memory mechanism that replaces constant momentum with a dynamic momentum coefficient.<n>We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks.<n>Our work opens doors for new ways of inducing adaptivity in optimization.
- Score: 10.325245543844245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an \textit{adaptive memory} mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.
Related papers
- Optimizing ML Training with Metagradient Descent [69.89631748402377]
We introduce an algorithm for efficiently calculating metagradients -- gradients through model training -- at scale.<n>We then introduce a "smooth model training" framework that enables effective optimization using metagradients.
arXiv Detail & Related papers (2025-03-17T22:18:24Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - Revisiting the Initial Steps in Adaptive Gradient Descent Optimization [6.468625143772815]
Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks.<n>These methods often suffer from suboptimal generalization compared to descent gradient (SGD) and exhibit instability.<n>We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values.
arXiv Detail & Related papers (2024-12-03T04:28:14Z) - MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.67982828148859]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Memory-Efficient Optimization with Factorized Hamiltonian Descent [11.01832755213396]
We introduce a novel adaptive, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge.
By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level.
We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees.
arXiv Detail & Related papers (2024-06-14T12:05:17Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Multiplicative update rules for accelerating deep learning training and
increasing robustness [69.90473612073767]
We propose an optimization framework that fits to a wide range of machine learning algorithms and enables one to apply alternative update rules.
We claim that the proposed framework accelerates training, while leading to more robust models in contrast to traditionally used additive update rule.
arXiv Detail & Related papers (2023-07-14T06:44:43Z) - Flatter, faster: scaling momentum for optimal speedup of SGD [0.0]
We study training dynamics arising from interplay between gradient descent (SGD) and label noise and momentum in the training of neural networks.
We find that scaling the momentum hyper parameter $1-NISTbeta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization.
arXiv Detail & Related papers (2022-10-28T20:41:48Z) - Training Deep Neural Networks with Adaptive Momentum Inspired by the
Quadratic Optimization [20.782428252187024]
We propose a new adaptive momentum inspired by the optimal choice of the heavy ball momentum for optimization.
Our proposed adaptive heavy ball momentum can improve gradient descent (SGD) and Adam.
We verify the efficiency of SGD and Adam with the new adaptive momentum on extensive machine learning benchmarks, including image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2021-10-18T07:03:48Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - On the Trend-corrected Variant of Adaptive Stochastic Optimization
Methods [30.084554989542475]
We present a new framework for Adam-type methods with the trend information when updating the parameters with the adaptive step size and gradients.
We show empirically the importance of adding the trend component, where our framework outperforms the conventional Adam and AMSGrad methods constantly on the classical models with several real-world datasets.
arXiv Detail & Related papers (2020-01-17T01:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.