Fugu-MT 論文翻訳(概要): Understanding Gradient Descent on Edge of Stability in Deep Learning

論文の概要: Understanding Gradient Descent on Edge of Stability in Deep Learning

arxiv url: http://arxiv.org/abs/2205.09745v1
Date: Thu, 19 May 2022 17:57:01 GMT
ステータス: 翻訳完了
システム内更新日: 2022-05-20 14:48:06.530031
Title: Understanding Gradient Descent on Edge of Stability in Deep Learning
Title（参考訳）: 深層学習における安定性のエッジにおける勾配降下の理解
Authors: Sanjeev Arora, Zhiyuan Li, Abhishek Panigrahi
Abstract要約: 本稿では,EoS相における暗黙的正則化の新たなメカニズムを数学的に解析し,非滑らかな損失景観によるGD更新が,最小損失の多様体上の決定論的流れに沿って進化することを示した。以上の理論的結果は実験によって裏付けられている。
参考スコア（独自算出の注目度）: 32.03036040349019
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) {\em Normalized GD}, i.e., GD with a varying LR $ \eta_t =\frac{ \eta }{ || \nabla L(x(t)) || } $ and loss $L$; (2) GD with constant LR and loss $\sqrt{L}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{\max}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.
Abstract（参考訳）: Cohen et al. (2021) における、決定論的勾配 Descent (GD) を用いた深層学習実験では、学習率 (LR) と鋭さ (\emph{i.e.}) が従来の最適化ではもはや振る舞わないときの安定性のエッジ (EoS) が明らかにされた。シャープネスは約2ドル/lrで安定し、損失はイテレーションで上下するが、全体的な下降傾向は続く。本稿では,eos相における暗黙的正則化の新しいメカニズムを数学的に解析し,最小損失多様体上の決定論的流れに沿ってgd更新が進化することを示す。これは、無限小更新や勾配のノイズに依存する暗黙のバイアスに関する以前の多くの結果とは対照的である。形式的には、ある正則性条件の任意の滑らかな函数 $L$ に対して、この効果は (1) {\displaystyle {\em Normalized GD} に対して示される、すなわち、異なる LR $ \eta_t =\frac{ \eta }{|| \nabla L(x(t))|| } $ と損失 $L$; (2) 一定LR と損失 $\sqrt{L}$ に対して示される。どちらも安定性の辺に到達し、多様体上の関連する流れは$\lambda_{\max}(\nabla^2 l)$ を最小化する。上記の理論結果は実験的研究によって裏付けられている。

関連論文リスト

Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD [0.0]
我々は,ミニバッチ勾配降下(SGD)列車が異なる体制で「エッジ・オブ・安定性(EoSS)」と呼ばれることを示す。 2/eta$で安定化されるのは *Batch Sharpness* である。さらに,SGD軌道の数学的モデリングについて考察する。
論文参考訳（メタデータ） (2024-12-29T18:59:01Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION は、勾配カルシュ=クーン=T (sqrtdK-)$で測定された $cal(sqrtdK-)$ の反復を収束する。従来のSGDと比較して,LIONは損失が小さく,性能も高いことを示す。
論文参考訳（メタデータ） (2024-11-12T11:30:53Z)
Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
我々は、(強に)凸 $(L0)$-smooth 関数のクラスに焦点を当て、いくつかの既存のメソッドに対する新しい収束保証を導出する。特に,スムーズなグラディエント・クリッピングを有するグラディエント・ディフレッシュと,ポリアク・ステップサイズを有するグラディエント・ディフレッシュのコンバージェンス・レートの改善を導出した。
論文参考訳（メタデータ） (2024-09-23T13:11:37Z)
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
線形分離可能なデータを用いたロジスティック回帰に一定の段差を持つ勾配降下(GD)を考える。 GD はこの初期振動位相を急速に終了し、$mathcalO(eta)$ steps となり、その後$tildemathcalO (1 / (eta t) )$ convergence rate が得られることを示す。我々の結果は、予算が$T$ ステップであれば、GD は攻撃的なステップサイズで $tildemathcalO (1/T2)$ の加速損失を達成できることを示している。
論文参考訳（メタデータ） (2024-02-24T23:10:28Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
線形混合遷移カーネルを用いた最短経路(SSP)問題について検討する。エージェントは繰り返し環境と対話し、累積コストを最小化しながら特定の目標状態に到達する。既存の作業は、イテレーションコスト関数の厳密な下限や、最適ポリシーに対する期待長の上限を仮定することが多い。
論文参考訳（メタデータ） (2024-02-14T07:52:00Z)
Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex Optimization [9.019243171993553]
トレーニングステップ$T$とStep-size$eta$は、滑らかな凸最適化(SCO)問題の認定に影響を与える可能性がある。まず、グラディエントDescent(GD)とグラディエントDescent(SGD)の厳密な過剰リスク低境界を提供する。近年の作業は、より良い速度で達成できるが、トレーニング時間が長い場合には改善が減少する。
論文参考訳（メタデータ） (2023-03-19T20:24:33Z)
Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
グラディエント・クリッピング(Gradient clipping)は、大規模言語モデリングのようなディープラーニングアプリケーションで用いられる技法である。最近の実験的な訓練は、秩序の複雑さを緩和する、非常に特別な振る舞いを持っている。
論文参考訳（メタデータ） (2023-03-02T00:57:38Z)
Generalization Bounds for Gradient Methods via Discrete and Continuous Prior [8.76346911214414]
次数$O(frac1n + fracL2nsum_t=1T(gamma_t/varepsilon_t)2)$の新たな高確率一般化境界を示す。また、あるSGDの変種に対する新しい境界を得ることもできる。
論文参考訳（メタデータ） (2022-05-27T07:23:01Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
勾配ランゲヴィン・ダイナミクスは非エプス最適化問題を解くための最も基本的なアルゴリズムの1つである。本稿では、このタイプの2つの変種、すなわち、分散還元ランジュバンダイナミクスと再帰勾配ランジュバンダイナミクスを示す。
論文参考訳（メタデータ） (2022-03-30T11:39:00Z)
Black-Box Generalization [31.80268332522017]
微分一般化によるブラックボックス学習のための最初の誤り解析を行う。どちらの一般化も独立$d$,$K$であり、適切な選択の下では学習率がわずかに低下していることを示す。
論文参考訳（メタデータ） (2022-02-14T17:14:48Z)
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
SGD(Gradient Descent)の暗黙のバイアスを理解することは、ディープラーニングにおける重要な課題の1つである。本稿では、Katzenberger (1991) のアイデアを適応させることにより、そのような分析の一般的な枠組みを提供する。 1) a global analysis of the implicit bias for $eta-2$ steps, not to the local analysis of Blanc et al. (2020) that is only for $eta-1.6$ steps and (2) allowing any noise covariance。
論文参考訳（メタデータ） (2021-10-13T17:50:46Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
本稿では,中等度学習におけるSGDの特定の正規化効果を特徴付けることを試みる。 SGDはデータ行列の大きな固有値方向に沿って収束し、GDは小さな固有値方向に沿って収束することを示す。
論文参考訳（メタデータ） (2020-11-04T21:07:52Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。