Fugu-MT 論文翻訳(概要): Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

論文の概要: Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

arxiv url: http://arxiv.org/abs/2605.18694v1
Date: Mon, 18 May 2026 17:30:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:50.201293
Title: Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
Title（参考訳）: 重音下での適応的勾配法は収束可能か? : AdaGradを例として
Authors: Zijian Liu,
Abstract要約: 適応的勾配法は,アルゴリズム的な変化を伴わずに最小限の雑音を収束させることができることを示す。また、重み付き最適化のための既存の minimax レートが $mathtAdaGrad$ では達成できないことを示し、既存の minimax レートが $mathtAdaGrad$ では達成できないことを示唆する。
参考スコア（独自算出の注目度）: 3.8357180714081327
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3<p\leq2$. Notably, this result is achieved without requiring any prior knowledge of $p$ and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by $\mathtt{AdaGrad}$. Lastly, we consider $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a popular variant of $\mathtt{AdaGrad}$ in theoretical studies, and show an improved rate that holds for any $1<p\leq2$ under an extra mild assumption.
Abstract（参考訳）: 現代の機械学習における多くのタスクは、最適化プロセス中に重尾の勾配ノイズを伴うことが観察されている。この現実的で挑戦的な設定を管理するため、一階アルゴリズムの収束を保証するため、勾配クリッピングや勾配正規化のような新しいメカニズムが導入された。しかし、アダプティブ勾配法は、一般的な$\mathtt{Adam}$と$\mathtt{AdamW}$を含む、現代のオプティマイザの有名なクラスである。したがって、適応勾配法がアルゴリズム的な変化を伴わずに重み付き雑音の下で収束できるかどうかを問うことは自然である。本研究では、適応勾配法の起源である特殊ケース $\mathtt{AdaGrad}$ を調べることで、この問題に答える第一歩を踏み出す。テール指数 $p$ が 4/3<p\leq2$ を満たすとき、非凸最適化において$\mathtt{AdaGrad}$ に対する最初の証明可能な収束率を提供する。特に、この結果は$p$の事前の知識を必要とせずに達成され、従ってテールインデックスに適応する。さらに,アルゴリズムに依存した下界の開発を行い,重み付け最適化のための既存のミニマックスレートが$\mathtt{AdaGrad}$で達成できないことを示唆した。最後に、理論的な研究において、$\mathtt{AdaGrad}\text{-}\matht{Norm}$ の一般的な変種である $\matht{AdaGrad}$ を検討し、追加の軽な仮定の下では、$<p\leq2$ が 1<p\leq2$ となるような改善率を示す。

論文の概要: Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

関連論文リスト