Fugu-MT 論文翻訳(概要): Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares

論文の概要: Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares

arxiv url: http://arxiv.org/abs/2510.17506v1
Date: Mon, 20 Oct 2025 13:02:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.456501
Title: Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares
Title（参考訳）: 過パラメトリ型最小方形における安定端の勾配勾配の収束速度
Authors: Lachlan Ewen MacDonald, Hancheng Min, Leandro Palma, Salma Tarmoun, Ziqing Xu, René Vidal,
Abstract要約: ニューラルネットワーク上の勾配降下は、安定性の端と呼ばれる大きなステップサイズで頻繁に実行される。過度にパラメータ化された最小二乗の設定において、学習率の高い勾配降下に対する収束率を提供する。
参考スコア（独自算出の注目度）: 33.60489399178793
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Classical optimisation theory guarantees monotonic objective decrease for gradient descent (GD) when employed in a small step size, or ``stable", regime. In contrast, gradient descent on neural networks is frequently performed in a large step size regime called the ``edge of stability", in which the objective decreases non-monotonically with an observed implicit bias towards flat minima. In this paper, we take a step toward quantifying this phenomenon by providing convergence rates for gradient descent with large learning rates in an overparametrised least squares setting. The key insight behind our analysis is that, as a consequence of overparametrisation, the set of global minimisers forms a Riemannian manifold $M$, which enables the decomposition of the GD dynamics into components parallel and orthogonal to $M$. The parallel component corresponds to Riemannian gradient descent on the objective sharpness, while the orthogonal component is a bifurcating dynamical system. This insight allows us to derive convergence rates in three regimes characterised by the learning rate size: (a) the subcritical regime, in which transient instability is overcome in finite time before linear convergence to a suboptimally flat global minimum; (b) the critical regime, in which instability persists for all time with a power-law convergence toward the optimally flat global minimum; and (c) the supercritical regime, in which instability persists for all time with linear convergence to an orbit of period two centred on the optimally flat global minimum.
Abstract（参考訳）: 古典的最適化理論は、小さなステップサイズ、すなわち「安定」状態において、勾配降下(GD)の単調な客観的減少を保証している。対照的に、ニューラルネットワーク上の勾配降下は「安定の端」と呼ばれる大きなステップサイズで頻繁に行われ、その目的は平らなミニマに対して観察された暗黙の偏差で単調に減少する。本稿では, この現象の定量化に向けて, 過度にパラメータ化された最小二乗の設定において, 勾配勾配の収束率と学習率を高め, 収束率を提供する。我々の分析の背景にある重要な洞察は、過度なパラメータ化の結果、大域的ミニミザーの集合がリーマン多様体$M$を形成し、GDダイナミクスを平行して$M$に直交する成分への分解を可能にすることである。平行成分は目的的鋭さのリーマン勾配降下に対応し、直交成分は分岐力学系である。この洞察は、学習率の大きさによって特徴づけられる3つの体制における収束率を導出することを可能にする。 a) 過渡的不安定性が線形収束前に有限時間で克服され、最適に平坦な大域最小値に収束する部分臨界状態 b) 不安定性が常に最適に平坦な大域最小値に収束して持続する臨界的体制 (c) 最適に平坦な大域最小値を中心とする周期2の軌道への線形収束で、常に不安定な状態が続く超臨界状態。

論文の概要: Convergence Rates for Gradient Descent on the Edge of Stability in Overparametrised Least Squares

関連論文リスト