Divergence Results and Convergence of a Variance Reduced Version of ADAM
- URL: http://arxiv.org/abs/2210.05607v1
- Date: Tue, 11 Oct 2022 16:54:56 GMT
- Title: Divergence Results and Convergence of a Variance Reduced Version of ADAM
- Authors: Ruiqi Wang and Diego Klabjan
- Abstract summary: We show that an ADAM-type algorithm converges, which means that it is the variance of gradients that causes the divergence of original ADAM.
Numerical experiments show that the proposed algorithm has as good performance as ADAM.
- Score: 30.10316505009956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic optimization algorithms using exponential moving averages of the
past gradients, such as ADAM, RMSProp and AdaGrad, have been having great
successes in many applications, especially in training deep neural networks.
ADAM in particular stands out as efficient and robust. Despite of its
outstanding performance, ADAM has been proved to be divergent for some specific
problems. We revisit the divergent question and provide divergent examples
under stronger conditions such as in expectation or high probability. Under a
variance reduction assumption, we show that an ADAM-type algorithm converges,
which means that it is the variance of gradients that causes the divergence of
original ADAM. To this end, we propose a variance reduced version of ADAM and
provide a convergent analysis of the algorithm. Numerical experiments show that
the proposed algorithm has as good performance as ADAM. Our work suggests a new
direction for fixing the convergence issues.
Related papers
- Covariance-Adaptive Sequential Black-box Optimization for Diffusion Targeted Generation [60.41803046775034]
We show how to perform user-preferred targeted generation via diffusion models with only black-box target scores of users.
Experiments on both numerical test problems and target-guided 3D-molecule generation tasks show the superior performance of our method in achieving better target scores.
arXiv Detail & Related papers (2024-06-02T17:26:27Z) - AA-DLADMM: An Accelerated ADMM-based Framework for Training Deep Neural
Networks [1.3812010983144802]
gradient descent (SGD) and its many variants are the widespread optimization algorithms for training deep neural networks.
SGD suffers from inevitable drawbacks, including vanishing gradients, lack of theoretical guarantees, and substantial sensitivity to input.
This paper proposes an Anderson Acceleration for Deep Learning ADMM (AA-DLADMM) algorithm to tackle this drawback.
arXiv Detail & Related papers (2024-01-08T01:22:00Z) - Moreau Envelope ADMM for Decentralized Weakly Convex Optimization [55.2289666758254]
This paper proposes a proximal variant of the alternating direction method of multipliers (ADMM) for distributed optimization.
The results of our numerical experiments indicate that our method is faster and more robust than widely-used approaches.
arXiv Detail & Related papers (2023-08-31T14:16:30Z) - Optimizing PatchCore for Few/many-shot Anomaly Detection [0.0]
Few-shot anomaly detection (AD) is an emerging sub-field of general AD.
We present a study on the performance of PatchCore, the current state-of-the-art full-shot AD/AS algorithm, in both the few-shot and the many-shot settings.
arXiv Detail & Related papers (2023-07-20T11:45:38Z) - Fixed-Point Automatic Differentiation of Forward--Backward Splitting Algorithms for Partly Smooth Functions [4.389150156866014]
Implicit (ID) and Automatic Differentiation (AD) are applied to the fixed-point iterations of proximal splitting algorithms.
We show that AD of the sequence generated by these algorithms converges to the derivative of the solution mapping.
For a variant of automatic differentiation, which we call Fixed-Point Automatic Differentiation (FPAD), we remedy the memory overhead problem of the Reverse Mode AD.
arXiv Detail & Related papers (2022-08-05T11:27:55Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Adam revisited: a weighted past gradients perspective [57.54752290924522]
We propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues.
We prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD.
arXiv Detail & Related papers (2021-01-01T14:01:52Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - Disentangled Representation Learning and Generation with Manifold
Optimization [10.69910379275607]
This work presents a representation learning framework that explicitly promotes disentanglement by encouraging directions of variations.
Our theoretical discussion and various experiments show that the proposed model improves over many VAE variants in terms of both generation quality and disentangled representation learning.
arXiv Detail & Related papers (2020-06-12T10:00:49Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.