Fugu-MT 論文翻訳(概要): Implicit Bias of AdamW: $\ell

論文の概要: Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization

arxiv url: http://arxiv.org/abs/2404.04454v1
Date: Fri, 5 Apr 2024 23:56:50 GMT
ステータス: 翻訳完了
システム内更新日: 2024-04-09 21:08:32.865530
Title: Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization
Title（参考訳）: AdamW: $\ell_\infty$ Norm Constrained Optimization
Authors: Shuo Xie, Zhiyuan Li,
Abstract要約: 重量減衰を持つアダム(AdamW)は、言語モデリングタスクにおける優れた性能で広く評価されている。我々はAdamWの利点を理解するために、暗黙的に制約付き最適化を行うことを示す。フルバッチ設定では、AdamWが部分和が分岐する非増加学習率スケジュールに収束した場合、元の損失のKKT点に収束しなければならないことを示す。
参考スコア（独自算出の注目度）: 5.896194021915813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
Abstract（参考訳）: ウェイト崩壊を分離したAdam(AdamW)は、言語モデリングタスクにおける優れた性能で広く評価されており、一般化と最適化の点でAdamを$\ell_2$正規化で上回っている。しかし、この利点は理論的にはよく理解されていない。直感的には$\ell_2$正規化は$\ell_2$正規化損失を最適化するが、AdamWが特定の目的を最適化するかどうかは不明である。本研究では,AdamWが暗黙的に制約付き最適化を行うことを示すことにより,AdamWのメリットを理解するために前進する。より具体的には、AdamWが部分和が発散する任意の非増加学習率スケジュールと収束する場合、パラメータの$\ell_\infty$ノルムがウェイト崩壊係数の逆数で有界であるという制約の下で、元の損失のKKT点に収束しなければならない。この結果は、Adam を SignGD の滑らかなバージョンと見なすことができ、これは$\ell_\infty$ノルムに対して正規化された最も急降下であり、重量減衰を伴う正規化された最も急降下とフランク=ウルフの間の驚くべき関係である。

関連論文リスト

Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
我々は、Adamが以前の$cal O(fracln TTs14)$よりも$cal O(frac1Ts14)$の最適なレートを達成することを示す。我々の理論分析は、収束を保証する重要な要因として運動量の役割に関する新たな洞察を提供する。
論文参考訳（メタデータ） (2025-07-08T13:19:26Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION は、勾配カルシュ=クーン=T (sqrtdK-)$で測定された $cal(sqrtdK-)$ の反復を収束する。従来のSGDと比較して,LIONは損失が小さく,性能も高いことを示す。
論文参考訳（メタデータ） (2024-11-12T11:30:53Z)
ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate [21.378608502899077]
本稿では,ADOPTという新しい適応勾配法を提案する。これは,有界雑音の仮定に依存することなく,$mathcalOの最適収束率を実現する。 ADOPTは、画像分類、生成モデル、自然言語処理、深層強化学習など、幅広いタスクにおいて、Adamとその変種と比較して優れた結果が得られる。
論文参考訳（メタデータ） (2024-11-05T06:57:47Z)
Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity [6.270305440413688]
好ましくは $ell_infty$-geometry が SGD であるのに対して、Adam は影響を受けていない。我々の実験は、好ましくは $ell_infty$-geometry が SGD であるのに対して、Adam が影響を受けていない場合、さらに悪化することを確認した。
論文参考訳（メタデータ） (2024-10-10T17:58:53Z)
Decoupled Weight Decay for Any $p$ Norm [1.1510009152620668]
トレーニング中の正規化に$L_p$のブリッジをベースとした,スパーシフィケーションに対する単純かつ効果的なアプローチを検討する。我々は、標準の$L$重み崩壊を任意の$p$ノルムに一般化する新しい重み崩壊スキームを導入する。標準的な$L$正規化に匹敵する性能を維持しながら、非常に疎結合なネットワークにつながることを実証的に実証した。
論文参考訳（メタデータ） (2024-04-16T18:02:15Z)
Closing the Gap Between the Upper Bound and the Lower Bound of Adam's Iteration Complexity [51.96093077151991]
我々はAdamの新しい収束保証を導出し、$L$-smooth条件と有界雑音分散仮定のみを導出する。本証明は,運動量と適応学習率の絡み合いを扱うために,新しい手法を利用する。
論文参考訳（メタデータ） (2023-10-27T09:16:58Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
アダムは急速に収束するため、実用的な用途で広く採用されている。アダムの既存の収束解析は、有界な滑らかさの仮定に依存する。本稿では,ランダムにリシャッフルされたAdamの学習率の低下に伴う収束について検討する。
論文参考訳（メタデータ） (2022-08-21T14:57:47Z)
Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam は $ell$ regularizer Adam-$ell$ の一般化である。 AdamWは、Adam-$ell$の更新ルールからAdam-$ell$の勾配を分離する。我々はAdamWがAdam-$ell$よりも有利であることを示し、ネットワークの勾配が複数のスケールを示すことを期待する度合いを示す。
論文参考訳（メタデータ） (2022-01-31T21:00:55Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adamはディープラーニングアプリケーションに広く使われている最適化手法である。我々はAdam$+$(Adam-plusと発音する)という新しい方法を提案する。画像分類,言語モデリング,自動音声認識など,さまざまなディープラーニングタスクに関する実証研究により,Adam$+$がAdamを著しく上回ることを示した。
論文参考訳（メタデータ） (2020-11-24T09:28:53Z)
A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
我々はAdam Adagradと$O(d(N)/st)$アルゴリズムの収束の証明を示す。 Adamはデフォルトパラメータで使用する場合と同じ収束$O(d(N)/st)$で収束する。
論文参考訳（メタデータ） (2020-03-05T01:56:17Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。