Fugu-MT 論文翻訳(概要): Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

論文の概要: Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

arxiv url: http://arxiv.org/abs/2304.13960v1
Date: Thu, 27 Apr 2023 05:41:13 GMT
ステータス: 翻訳完了
システム内更新日: 2023-04-28 14:14:32.086325
Title: Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Title（参考訳）: トランスフォーマーのsgdとアダムのギャップの主な要因はノイズではなく、サイン降下かもしれない
Authors: Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark Schmidt
Abstract要約: 大規模なバッチを持つAdamの挙動は、運動量を持つ符号降下と類似していることが示される。我々は,SGDとAdamのパフォーマンスギャップにおいて,重み付けノイズと重み付けノイズが重要な要因ではないことを示す。
参考スコア（独自算出の注目度）: 16.170888329408353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The success of the Adam optimizer on a wide array of architectures has made it the default in settings where stochastic gradient descent (SGD) performs poorly. However, our theoretical understanding of this discrepancy is lagging, preventing the development of significant improvements on either algorithm. Recent work advances the hypothesis that Adam and other heuristics like gradient clipping outperform SGD on language tasks because the distribution of the error induced by sampling has heavy tails. This suggests that Adam outperform SGD because it uses a more robust gradient estimate. We evaluate this hypothesis by varying the batch size, up to the entire dataset, to control for stochasticity. We present evidence that stochasticity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam. Rather, Adam performs better as the batch size increases, while SGD is less effective at taking advantage of the reduction in noise. This raises the question as to why Adam outperforms SGD in the full-batch setting. Through further investigation of simpler variants of SGD, we find that the behavior of Adam with large batches is similar to sign descent with momentum.
Abstract（参考訳）: 幅広いアーキテクチャでadamオプティマイザが成功したことで、確率的勾配降下(sgd)がパフォーマンスの悪い設定ではデフォルトとなった。しかし、この違いに対する理論的理解は遅れており、どちらのアルゴリズムにも大きな改善が生じるのを防いでいる。最近の研究は、サンプリングによって引き起こされるエラーの分布が重く、アダムや他のヒューリスティックス、例えばグラデーション・クリッピングが言語タスクのsgdよりも優れているという仮説を推し進めている。これは、アダムがより堅牢な勾配推定を使用するため、SGDより優れていることを示唆している。バッチサイズをデータセット全体まで変更し,確率性を制御することにより,この仮説を評価する。我々は,sgdとadamの性能差において,確率性や重み付き雑音は大きな要因ではないことを示す。むしろ、Adamはバッチサイズが大きくなるにつれて性能が向上する一方、SGDはノイズ低減の利点を生かしにくい。これはAdamがフルバッチ環境でSGDを上回った理由に関する疑問を提起する。 SGDのより単純な変種に関するさらなる研究により、大きなバッチを持つAdamの挙動は運動量を持つ符号降下と似ていることが判明した。

関連論文リスト

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling [36.106114687828395]
言語モデルでは、AdamはGradient Descent(SGD)よりもはるかに優れていることが知られている。我々は,SGDとAdamのギャップに運動量,勾配クリッピング,バッチサイズがどのような影響を及ぼすか,徹底的に検討した。
論文参考訳（メタデータ） (2025-06-14T15:37:31Z)
Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity [6.270305440413688]
好ましくは $ell_infty$-geometry が SGD であるのに対して、Adam は影響を受けていない。我々の実験は、好ましくは $ell_infty$-geometry が SGD であるのに対して、Adam が影響を受けていない場合、さらに悪化することを確認した。
論文参考訳（メタデータ） (2024-10-10T17:58:53Z)
On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions [4.9495085874952895]
Adaptive Momentum Estimation (Adam)アルゴリズムは、様々なディープラーニングタスクにおいて非常に効果的である。この一般的な雑音モデルの下で,Adamは高い反復率で定常点のばらつきを見いだせることを示す。
論文参考訳（メタデータ） (2024-02-06T13:19:26Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
アダムは急速に収束するため、実用的な用途で広く採用されている。アダムの既存の収束解析は、有界な滑らかさの仮定に依存する。本稿では,ランダムにリシャッフルされたAdamの学習率の低下に伴う収束について検討する。
論文参考訳（メタデータ） (2022-08-21T14:57:47Z)
Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam は $ell$ regularizer Adam-$ell$ の一般化である。 AdamWは、Adam-$ell$の更新ルールからAdam-$ell$の勾配を分離する。我々はAdamWがAdam-$ell$よりも有利であることを示し、ネットワークの勾配が複数のスケールを示すことを期待する度合いを示す。
論文参考訳（メタデータ） (2022-01-31T21:00:55Z)
Why Does Multi-Epoch Training Help? [62.946840431501855]
経験的に、トレーニングデータ(マルチパスSGD)を1回通過する方が、トレーニングデータ(ワンパスSGD)のみを1回通過するSGDよりもはるかに優れたリスクバウンド性能を有することが観察されている。本稿では,トレーニングデータの複数パスが,特定の状況下での性能向上に有効である理由を理論的根拠として提示する。
論文参考訳（メタデータ） (2021-05-13T00:52:25Z)
Correcting Momentum with Second-order Information [50.992629498861724]
最適積に$O(epsilon)$epsilon点を求める非臨界最適化のための新しいアルゴリズムを開発した。我々は、さまざまな大規模ディープラーニングベンチマークとアーキテクチャで結果を検証する。
論文参考訳（メタデータ） (2021-03-04T19:01:20Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adamはディープラーニングアプリケーションに広く使われている最適化手法である。我々はAdam$+$(Adam-plusと発音する)という新しい方法を提案する。画像分類,言語モデリング,自動音声認識など,さまざまなディープラーニングタスクに関する実証研究により,Adam$+$がAdamを著しく上回ることを示した。
論文参考訳（メタデータ） (2020-11-24T09:28:53Z)
AdaSGD: Bridging the gap between SGD and Adam [14.886598905466604]
我々はSGDとAdamのパフォーマンスの潜在的な違いを同定する。我々は、AdaSGDがSGD AdamとSGD非降下の両方の利点を組み合わせていることを実証する。
論文参考訳（メタデータ） (2020-06-30T05:44:19Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
正規化技術は現代の深層学習の恩恵である。しかし、運動量を導入することで、スケール不変の重みに対する効果的なステップサイズが急速に小さくなることがしばしば見過ごされる。本稿では,この2つの材料の組み合わせが,有効ステップサイズと準最適モデル性能の早期劣化につながることを検証した。
論文参考訳（メタデータ） (2020-06-15T08:35:15Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。