Related papers: Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

URL: http://arxiv.org/abs/2412.19444v1
Date: Fri, 27 Dec 2024 04:22:02 GMT
Title: Towards Simple and Provable Parameter-Free Adaptive Gradient Methods
Authors: Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu,
Abstract summary: We present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees.<n>We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.
Score: 56.060918447252625
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of AdaGrad++ and Adam++.

Related papers

Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed [83.8485684139678]
Methods with adaptive steps, as AdaGrad and Adam, are essential for training modern Deep Learning models. We show that AdaGrad can have bad high-probability convergence if the noise istailed. We propose a new version of AdaGrad called Clip-RAD RedaGrad with Delay.
arXiv Detail & Related papers (2024-06-06T18:49:10Z)
Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad [16.249992982986956]
This paper introduces a novel adaptive algorithm named KATE which consistently matches complex machine learning tasks.<n>We compare KATE to other state-of-the-art adaptive algorithms Adam AdaGrad in numerical experiments with different problems.
arXiv Detail & Related papers (2024-03-05T04:35:59Z)
Stochastic Gradient Sampling for Enhancing Neural Networks Training [0.0]
We introduce StochGradAdam, a novel extension of the Adam algorithm, incorporating gradient sampling techniques. StochGradAdam achieves comparable or superior performance to Adam, even when using fewer gradient updates per iteration. The results suggest that this approach is particularly effective for large-scale models and datasets.
arXiv Detail & Related papers (2023-10-25T22:45:31Z)
Learning-Rate-Free Learning by D-Adaptation [18.853820404058983]
D-Adaptation is an approach to automatically setting the learning rate which achieves the optimal rate of convergence for convex Lipschitz functions. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems.
arXiv Detail & Related papers (2023-01-18T19:00:50Z)
A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks. Recent examples include AdaGrad and Adam. We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z)
Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances. Online descent (OGD) is a popular approach to handle streaming data in pairwise learning. In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective [0.0]
We propose a new fast, Generalized AdaGrad (G-AdaGrad) for solving non machine learning problems. Specifically, we adopt a state-space perspective for analyzing the convergence acceleration algorithms, namely G-AdaGrad and Adam.
arXiv Detail & Related papers (2021-05-31T20:30:25Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.