Large Batch Analysis for Adagrad Under Anisotropic Smoothness
- URL: http://arxiv.org/abs/2406.15244v1
- Date: Fri, 21 Jun 2024 15:29:31 GMT
- Title: Large Batch Analysis for Adagrad Under Anisotropic Smoothness
- Authors: Yuxing Liu, Rui Pan, Tong Zhang,
- Abstract summary: Adaptive algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models.
Despite their huge success in practice, their theoretical advantages over gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice.
This is because the only theoretical result that can demonstrate the benefit of Adagrad over SGD in the original paper of Adagrad.
We present detailed comparisons between SGD and Adagrad to provide a better understanding of the benefits of adaptive gradient methods.
- Score: 10.995979046710893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adaptive gradient algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite their huge success in practice, their theoretical advantages over stochastic gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate the benefit of Adagrad over SGD was obtained in the original paper of Adagrad for nonsmooth objective functions. However, for nonsmooth objective functions, there can be a linear slowdown of convergence when batch size increases, and thus a convergence analysis based on nonsmooth assumption cannot be used for large batch algorithms. In this work, we resolve this gap between theory and practice by providing a new analysis of Adagrad on both convex and nonconvex smooth objectives suitable for the large batch setting. It is shown that under the anisotropic smoothness and noise conditions, increased batch size does not slow down convergence for Adagrad, and thus it can still achieve a faster convergence guarantee over SGD even in the large batch setting. We present detailed comparisons between SGD and Adagrad to provide a better understanding of the benefits of adaptive gradient methods. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our theoretical analysis.
Related papers
- Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT)
We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z) - Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling [0.6906005491572401]
The paper provides theoretical insights on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates, and what the optimal learning rate is.
arXiv Detail & Related papers (2023-11-15T07:27:40Z) - Enhancing Generalization of Universal Adversarial Perturbation through
Gradient Aggregation [40.18851174642427]
Deep neural networks are vulnerable to universal adversarial perturbation (UAP)
In this paper, we examine the serious dilemma of UAP generation methods from a generalization perspective.
We propose a simple and effective method called Gradient Aggregation (SGA)
SGA alleviates the gradient vanishing and escapes from poor local optima at the same time.
arXiv Detail & Related papers (2023-08-11T08:44:58Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - A High Probability Analysis of Adaptive SGD with Momentum [22.9530287983179]
Gradient Descent (DSG) and its variants are the most used algorithms in machine learning applications.
We show for the first time the probability of the gradients to zero in smooth non setting for DelayedGrad with momentum.
arXiv Detail & Related papers (2020-07-28T15:06:22Z) - Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite
Epochs [25.158203665218164]
We show that adaptive gradient methods can be faster than random shuffling SGD after finite time.
To the best of our knowledge, it is the first to demonstrate that adaptive gradient methods can be faster than SGD after finite time.
arXiv Detail & Related papers (2020-06-12T09:39:47Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.