Related papers: AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate

URL: http://arxiv.org/abs/2511.13465v3
Date: Thu, 20 Nov 2025 05:55:08 GMT
Title: AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate
Authors: Meng Zhu, Quan Xiao, Weidong Min,
Abstract summary: The AdamNX algorithm is proposed to converge high-dimensional optimization to local and even global minima.<n>Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate.<n>Our results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate.
Score: 13.40796672049436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since the 21st century, artificial intelligence has been leading a new round of industrial revolution. Under the training framework, the optimization algorithm aims to stably converge high-dimensional optimization to local and even global minima. Entering the era of large language models, although the scale of model parameters and data has increased, Adam remains the mainstream optimization algorithm. However, compared with stochastic gradient descent (SGD) based optimization algorithms, Adam is more likely to converge to non-flat minima. To address this issue, the AdamNX algorithm is proposed. Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate, which gradually weakens the learning step correction strength as training progresses, and degrades to momentum SGD in the stable training period, thereby improving the stability of training in the stable period and possibly enhancing generalization ability. Experimental results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate, and AdamNX can stably outperform Adam and its variants in terms of performance. Our code is open-sourced at https://github.com/mengzhu0308/AdamNX.

Related papers

A Physics-Inspired Optimizer: Velocity Regularized Adam [9.38448580878081]
We introduce Velocity-Regularized Adamvolution (VRAdam) for training deep neural networks.<n>VRAdam adds a higher order penalty on the learning rate based on the velocity.<n>We demonstrate that VRAdam exceeds the performance against standard convexs including AdamW.
arXiv Detail & Related papers (2025-05-19T14:51:40Z)
CAdam: Confidence-Based Optimization for Online Learning [41.022196390765714]
We introduce CAdam, a confidence-based optimization strategy that assesses the consistency between the momentum and the gradient for each parameter dimension before deciding on updates.<n>CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV)<n>In large-scale A/B testing, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV)
arXiv Detail & Related papers (2024-11-29T12:00:27Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.67982828148859]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization [90.08459757321405]
Federated Adam (FedAdam) algorithms suffer from a threefold increase in uplink communication overhead.<n>We propose a novel sparse FedAdam algorithm called FedAdam-SSM, wherein distributed devices sparsify the updates local model parameters and moment estimates.<n>By minimizing the divergence bound between the model trained by FedAdam-SSM and centralized Adam, we optimize the SSM to mitigate the learning performance degradation caused by sparsification error.
arXiv Detail & Related papers (2024-05-28T07:56:49Z)
Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning [2.695991050833627]
We propose a new optimization algorithm named CG-like-Adam for deep learning.<n>Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like.<n> Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.
arXiv Detail & Related papers (2024-04-02T07:57:17Z)
Improving the Adaptive Moment Estimation (ADAM) stochastic optimizer through an Implicit-Explicit (IMEX) time-stepping approach [1.2233362977312945]
The classical Adam algorithm is a first-order implicit-explicit (IMEX) discretization of the underlying ODE. We propose new extensions of the Adam scheme obtained by using higher-order IMEX methods to solve the ODE. We derive a new optimization algorithm for neural network training that performs better than classical Adam on several regression and classification problems.
arXiv Detail & Related papers (2024-03-20T16:08:27Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
AdaX: Adaptive Gradient Descent with Exponential Long Term Memory [34.6432726391469]
We analyze a problem of Adam by analyzing its performance in simple non-vision machine learning tasks. We propose a novel adaptive gradient named AdaX to solve the problem. AdaX outperforms Adam in various computer natural language processing tasks.
arXiv Detail & Related papers (2020-04-21T03:46:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.