Fugu-MT 論文翻訳(概要): Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

論文の概要: Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

arxiv url: http://arxiv.org/abs/2605.18528v1
Date: Mon, 18 May 2026 15:13:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.897173
Title: Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
Title（参考訳）: スケール不変ニューラルネットワーク最適化:ノーム幾何学と重音
Authors: Jiayu Zhang, Tianyi Lin,
Abstract要約: スペクトルノルムを持つスケール不変の1次法は$(minm, n-frac3p-2p-1)の呼び出しを必要とすることを示す。我々は、標準がスペクトルであり、ヘシアンがリプシッツであるとき、バッチ法が$(minm, n-frac5p2p-2p-2)$のマッチング境界を達成することを証明した。
参考スコア（独自算出の注目度）: 12.977441534320041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ with general norms, where the goal is to achieve an $ε$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any scale-invariant first-order method with spectral norm requires $Ω(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$ oracle calls. We prove that a batched Scion method with spectral norm achieves the matching upper bound of $O(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}ε^{-\frac{5p-3}{2p-2}})$ when the norm is spectral and the Hessian is Lipschitz. Finally, we incorporate practical heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility in training neural networks.
Abstract（参考訳）: ニューラルネットワーク最適化からの教訓は、オプティマイザ設計はモデルをパラメータ化する方法を尊重するべきだ、ということだ。モデルサイズを越えたハイパーパラメータ転送をサポートするだけでなく、入力出力行列のノルム幾何を利用するため、スケール不変の手法が重要となる。同時に、深層学習における確率的勾配雑音は、しばしばガウス以南から遠く離れており、重く尾が現れることがある。これらの決定的な観測は、ニューラルネットワークを訓練するアルゴリズムの原理を形作っているが、それらの共同理論的な結果は未解明のままである。特に、一般的な入力出力行列ノルムを持つスケール不変法では、どの次元依存が避けられないのか、高次の滑らかさが重み付き雑音下でのトレーニングを加速できるかどうかは不明である。一般ノルム付き$\mathbb{R}^{m\times n}$上の非凸な滑らかな確率的最適化を通してこれらの問題を研究し、そこでは、$p^{\mathrm{th}}$-moment 重み付き雑音の下で$ε$定常点を達成することが目的である。私たちの最初の寄与は次元に依存した下界である: $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ が十分大きいとき、スペクトルノルムを持つスケール不変な一階法は$Ω(\min\{m,n\}ε^{-\frac{3p-2}{p-1}})$ oracle call を必要とする。スペクトルノルムを持つバッチ化されたシオン法が、一致する上限である$O(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$を達成することを証明した。高次滑らか性を利用するために、輸送されたシオン法を提案し、ノルムがスペクトルでヘシアンがリプシッツであるとき、$O(\min\{m, n\}ε^{-\frac{5p-3}{2p-2}})へのバウンドを改善する。最後に、輸送された手法に実用的ヒューリスティックスを取り入れ、複数のアーキテクチャとモデルサイズにわたって評価し、ニューラルネットワークのトレーニングにおける柔軟性と互換性を実証する。

論文の概要: Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

関連論文リスト