EAdam Optimizer: How $\epsilon$ Impact Adam
- URL: http://arxiv.org/abs/2011.02150v1
- Date: Wed, 4 Nov 2020 06:39:44 GMT
- Title: EAdam Optimizer: How $\epsilon$ Impact Adam
- Authors: Wei Yuan and Kai-Xin Gao
- Abstract summary: We discuss the impact of the constant $epsilon$ for Adam in this paper.
Based on this finding, we propose a new variant of Adam called EAdam.
Our method can bring significant improvement compared with Adam.
- Score: 7.0552555621312605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many adaptive optimization methods have been proposed and used in deep
learning, in which Adam is regarded as the default algorithm and widely used in
many deep learning frameworks. Recently, many variants of Adam, such as
Adabound, RAdam and Adabelief, have been proposed and show better performance
than Adam. However, these variants mainly focus on changing the stepsize by
making differences on the gradient or the square of it. Motivated by the fact
that suitable damping is important for the success of powerful second-order
optimizers, we discuss the impact of the constant $\epsilon$ for Adam in this
paper. Surprisingly, we can obtain better performance than Adam simply changing
the position of $\epsilon$. Based on this finding, we propose a new variant of
Adam called EAdam, which doesn't need extra hyper-parameters or computational
costs. We also discuss the relationships and differences between our method and
Adam. Finally, we conduct extensive experiments on various popular tasks and
models. Experimental results show that our method can bring significant
improvement compared with Adam. Our code is available at
https://github.com/yuanwei2019/EAdam-optimizer.
Related papers
- Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps [65.64965527170156]
We adapt the widely used Adam optimiser for use in reinforcement learning.
We show that Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes.
We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
arXiv Detail & Related papers (2024-12-22T18:01:08Z) - CAdam: Confidence-Based Optimization for Online Learning [35.84013976735154]
We introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates.
Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known systems.
In large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam.
arXiv Detail & Related papers (2024-11-29T12:00:27Z) - Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling.
Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance.
arXiv Detail & Related papers (2024-07-10T18:11:40Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - Effectiveness of Optimization Algorithms in Deep Image Classification [6.368679897630892]
Two new adams, AdaBelief and Padam are introduced among community.
We analyze these two adams and compare them with other conventionals (Adam, SGD + Momentum) in the scenario of image classification.
We evaluate the performance of these optimization algorithms on AlexNet and simplified versions of VGGNet, ResNet using the EMNIST dataset.
arXiv Detail & Related papers (2021-10-04T17:50:51Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.