AdaSGD: Bridging the gap between SGD and Adam
- URL: http://arxiv.org/abs/2006.16541v1
- Date: Tue, 30 Jun 2020 05:44:19 GMT
- Title: AdaSGD: Bridging the gap between SGD and Adam
- Authors: Jiaxuan Wang, Jenna Wiens
- Abstract summary: We identify potential differences in performance between SGD and Adam.
We demonstrate how AdaSGD combines the benefits both SGD Adam and SGD non- descent.
- Score: 14.886598905466604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the context of stochastic gradient descent(SGD) and adaptive moment
estimation (Adam),researchers have recently proposed optimization techniques
that transition from Adam to SGD with the goal of improving both convergence
and generalization performance. However, precisely how each approach trades off
early progress and generalization is not well understood; thus, it is unclear
when or even if, one should transition from one approach to the other. In this
work, by first studying the convex setting, we identify potential contributors
to observed differences in performance between SGD and Adam. In particular,we
provide theoretical insights for when and why Adam outperforms SGD and vice
versa. We ad-dress the performance gap by adapting a single global learning
rate for SGD, which we refer to as AdaSGD. We justify this proposed approach
with empirical analyses in non-convex settings. On several datasets that span
three different domains,we demonstrate how AdaSGD combines the benefits of both
SGD and Adam, eliminating the need for approaches that transition from Adam to
SGD.
Related papers
- A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD [28.905886549938305]
We introduce a novel and comprehensive framework for analyzing the convergence properties of Adam.
We show that Adam attains non-asymptotic complexity sample bounds similar to those of gradient descent.
arXiv Detail & Related papers (2024-10-06T12:15:00Z) - Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling.
Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance.
arXiv Detail & Related papers (2024-07-10T18:11:40Z) - Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on
Transformers, but Sign Descent Might Be [16.170888329408353]
We show that the behavior of Adam with large batches is similar to sign descent with momentum.
We present evidence thatity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam.
arXiv Detail & Related papers (2023-04-27T05:41:13Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - A Unified Theory of Decentralized SGD with Changing Topology and Local
Updates [70.9701218475002]
We introduce a unified convergence analysis of decentralized communication methods.
We derive universal convergence rates for several applications.
Our proofs rely on weak assumptions.
arXiv Detail & Related papers (2020-03-23T17:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.