Fugu-MT 論文翻訳(概要): SGD at the Edge of Stability: The Stochastic Sharpness Gap

論文の概要: SGD at the Edge of Stability: The Stochastic Sharpness Gap

arxiv url: http://arxiv.org/abs/2604.21016v1
Date: Wed, 22 Apr 2026 19:02:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.14044
Title: SGD at the Edge of Stability: The Stochastic Sharpness Gap
Title（参考訳）: 安定の端のSGD:確率的シャープ性ギャップ
Authors: Fangshuo Liao, Afroditi Kolomvaki, Anastasios Kyrillidis,
Abstract要約: フルバッチ勾配勾配(GD)とステップサイズが$$のトレーニングネットワークでは、Hessianの最大の固有値は2/$に上がり、そこでホバリングする。 citetdamian 2023selfstab は、この挙動は損失の3階構造によって誘導される自己安定化機構によって説明され、GD は制約 $ S(boldsymbol)leq 2/$ 上の射影勾配降下 (PGD) に暗黙的に従うことを示した。ミニバッチ勾配勾配について
参考スコア（独自算出の注目度）: 10.176501817419371
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When training neural networks with full-batch gradient descent (GD) and step size $η$, the largest eigenvalue of the Hessian -- the sharpness $S(\boldsymbolθ)$ -- rises to $2/η$ and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(\boldsymbolθ)\leq 2/η$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/η$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/η$. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $ΔS = ηβσ_{\boldsymbol{u}}^{2}/(4α)$, where $α$ is the progressive sharpening rate, $β$ is the self-stabilization strength, and $σ_{ \boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.
Abstract（参考訳）: フルバッチ勾配降下(GD)とステップサイズ$η$を持つニューラルネットワークのトレーニングでは、ヘッセンの最大の固有値であるシャープネス$S(\boldsymbolθ)$が2/η$に上昇し、そこでホバリングする現象は安定性のエッジ(EoS)と呼ばれる。 \citet{damian2023selfstab} は、この挙動は損失の3階構造によって駆動される自己安定化機構によって説明され、GD は制約 $ S(\boldsymbolθ)\leq 2/η$ 上の射影勾配降下 (PGD) に暗黙的に従うことを示した。ミニバッチ確率勾配降下(SGD)の場合、シャープネスは2/η$以下で安定し、バッチサイズが減少するにつれてギャップが拡大するが、この抑制については理論的には説明されていない。確率的自己安定化を導入し、自己安定化フレームワークをSGDに拡張する。我々の重要な洞察は、勾配ノイズがヘッセン固有ベクトル上に沿った振動力学にばらつきを注入し、立方的鋭さ低減力を強化し、平衡を2/η$以下にシフトさせることである。フロイト{damian2023selfstab} のアプローチに従い、運動する勾配降下軌道に対する確率的予測力学を定義し、これらの予測からSGDの偏差を束縛する確率的結合定理を証明した。例えば、$ΔS = ηβσ_{\boldsymbol{u}}^{2}/(4α)$, where $α$ is the Progress sharpening rate, $β$ is the self-stabilization strength, $σ_{ \boldsymbol{u}}^{2}$ is the gradient noise variance on the top eigenvector。この式は、より小さなバッチサイズでよりフラットな解が得られることを予測し、バッチが完全なデータセットと等しいときにGDを回復する。

論文の概要: SGD at the Edge of Stability: The Stochastic Sharpness Gap

関連論文リスト