Related papers: CAdam: Confidence-Based Optimization for Online Learning

CAdam: Confidence-Based Optimization for Online Learning

URL: http://arxiv.org/abs/2411.19647v2
Date: Wed, 04 Jun 2025 14:58:21 GMT
Title: CAdam: Confidence-Based Optimization for Online Learning
Authors: Shaowen Wang, Anan Liu, Jian Xiao, Huan Liu, Yuekui Yang, Cong Xu, Qianqian Pu, Suncong Zheng, Wei Zhang, Di Wang, Jie Jiang, Jian Li,
Abstract summary: We introduce CAdam, a confidence-based optimization strategy that assesses the consistency between the momentum and the gradient for each parameter dimension before deciding on updates.<n>CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV)<n>In large-scale A/B testing, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV)
Score: 41.022196390765714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum ($m_t$) and adaptive learning rate ($v_t$). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noise, poses significant challenges to Adam's standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam's performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistency between the momentum and the gradient for each parameter dimension before deciding on updates. If momentum and gradient are in sync, CAdam proceeds with parameter updates according to Adam's original formulation; if not, it temporarily withholds updates and monitors potential shifts in data distribution in subsequent iterations. This method allows CAdam to distinguish between the true distributional shifts and mere noise, and to adapt more quickly to new data distributions. In various settings with distribution shift or noise, our experiments demonstrate that CAdam surpasses other well-known optimizers, including the original Adam. Furthermore, in large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system's gross merchandise volume (GMV).

Related papers

AdamNX: An Adam improvement algorithm based on a novel exponential decay mechanism for the second-order moment estimate [13.40796672049436]
The AdamNX algorithm is proposed to converge high-dimensional optimization to local and even global minima.<n>Its core innovation lies in the proposition of a novel type of second-order moment estimation exponential decay rate.<n>Our results show that our second-order moment estimation exponential decay rate is better than the current second-order moment estimation exponential decay rate.
arXiv Detail & Related papers (2025-11-17T15:07:55Z)
In Search of Adam's Secret Sauce [11.215133680044005]
We train over 1,300 language models across different data configurations and scales.<n>We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam.<n>We show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients.
arXiv Detail & Related papers (2025-05-27T23:30:18Z)
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training [22.58304858379219]
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training.<n>By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates.<n>AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance.
arXiv Detail & Related papers (2025-05-22T08:16:48Z)
Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps [65.64965527170156]
We adapt the widely used Adam optimiser for use in reinforcement learning. We show that Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes. We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
arXiv Detail & Related papers (2024-12-22T18:01:08Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
AdamL: A fast adaptive gradient method incorporating loss function [1.6025685183216696]
We propose AdamL, a novel variant of the Adam, that takes into account the loss function information to attain better results. We show that AdamL achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief. In the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.
arXiv Detail & Related papers (2023-12-23T16:32:29Z)
StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling [0.0]
We introduce StochGradAdam, a novel extension of the Adam algorithm, incorporating gradient sampling techniques. StochGradAdam achieves comparable or superior performance to Adam, even when using fewer gradient updates per iteration. The results suggest that this approach is particularly effective for large-scale models and datasets.
arXiv Detail & Related papers (2023-10-25T22:45:31Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
How Do Adam and Training Strategies Help BNNs Optimization? [50.22482900678071]
We show that Adam is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. We derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-06-21T17:59:51Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods [30.084554989542475]
We present a new framework for Adam-type methods with the trend information when updating the parameters with the adaptive step size and gradients. We show empirically the importance of adding the trend component, where our framework outperforms the conventional Adam and AMSGrad methods constantly on the classical models with several real-world datasets.
arXiv Detail & Related papers (2020-01-17T01:23:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.