Fugu-MT 論文翻訳(概要): On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

論文の概要: On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

arxiv url: http://arxiv.org/abs/2603.09952v1
Date: Tue, 10 Mar 2026 17:49:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.514433
Title: On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer
Title（参考訳）: 行列演算子ノルム下でのニューラルオプティマイザの幅スケーリングについて I:ロー/カラム正規化とハイパーパラメータ転送
Authors: Ruihan Xu, Jiajin Li, Yiping Lu,
Abstract要約: 平均正規化作用素ノルムの族は、層ワイズ可能性を認め、幅非依存な滑らかさ境界を得る。また、textrmMuonは、スムーズネス定数が$mathcalO(qrtw)最悪のケース成長に悩まされるのに対し、qmean$に正規化された新しい行の族は、幅に依存しない境界を達成できることを示す。
参考スコア（独自算出の注目度）: 10.976013033990448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover $μ$P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an $\mathcal{O}(\sqrt{w})$ worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.
Abstract（参考訳）: 現代のディープラーニングにおける中心的な問題は、ネットワーク幅が$w$増加するにつれて、動作が安定しているオプティマイザを設計する方法である。この問題は,行列作用素ノルムの下で最も急降下する例として, <textrm{AdamW} や \textrm{Muon} など,広く使用されているニューラルネットワークオプティマイザを解釈することによって解決される。この観点は、オプティマイザ幾何とネットワークフォワードマップのリプシッツ構造を結びつけ、リプシッツと滑らか性定数の幅非依存的な制御を可能にする。しかし、標準の$p \to q$演算ノルムによって引き起こされる最も急勾配規則は階層的な構成性に欠けており、したがって深層アーキテクチャにおいて幅に依存しない境界を与えることはできない。平均正規化作用素ノルムの族を$\pmean \to \qmean$と書くことでこの制限を克服し、層状構成性を認め、幅に依存しない滑らか性境界を生じさせ、例えば \emph{rescaled} \textrm{AdamW}, row normalization, column normalization のような実用的な最適化子を生み出す。得られた学習速度幅対応スケーリングルールは、特別なケースとして$μ$Pスケーリング~\cite{yang2021tensor}を回復し、幅広いオプティマイザのクラスにまたがるクロス幅学習レート転送の原則的メカニズムを提供する。さらに, 行正規化オプティマイザの新たなファミリは, 幅非依存のスムーズネス保証を実現する一方で, $\mathcal{O}(\sqrt{w})$ 最悪のスムーズネス定数の増大を被ることを示す。そこで本研究では,行/列単位の正規化のみをベースとして,モデル幅間の学習速度を安定的に伝達可能な幅対応最適化器MOGAを提案する。 GPT-2とLLaMAの大規模プレトレーニングでは、MOGA、特に行正規化では、Muonと競合する一方で、大規模で低損失なシステムでは特に高速であることが示されている。

論文の概要: On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

関連論文リスト