Fugu-MT 論文翻訳(概要): From SGD to Muon: Adaptive Optimization via Schatten-p Norms

論文の概要: From SGD to Muon: Adaptive Optimization via Schatten-p Norms

arxiv url: http://arxiv.org/abs/2605.19781v1
Date: Tue, 19 May 2026 12:47:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 01:01:02.817982
Title: From SGD to Muon: Adaptive Optimization via Schatten-p Norms
Title（参考訳）: SGDからMuonへ:Schatten-pノルムによる適応最適化
Authors: Thomas Massena, Corentin Friedrich, Mathieu Serrurier,
Abstract要約: Muonのようなモダンな言語は、更新に行列的な幾何学的制約を課している。現行のすべてのメソッドでは、更新ルールに対して固定LMOジオメトリを課している。本稿では,プロキシ・最適更新LMOジオメトリを動的に選択するための,新しい効率的なデータ駆動基準を提案する。
参考スコア（独自算出の注目度）: 3.5975968496682484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable, we pair it with efficient computational strategies, achieving only a $\sim$ 3% runtime overhead on highly optimized baselines. As a proof of concept, we show that this data-driven optimizer beats or remains competitive with the performance of the best performing optimizer between Muon and AdamW across three different training scenarios. Ultimately, this work provides evidence that LMO geometry can be successfully and efficiently adapted from runtime data, opening a new pathway for optimizer design beyond static geometries.
Abstract（参考訳）: Muonのような現代のオプティマイザは、更新に行列的な幾何学的制約を課している。これらの行列的な制約は、線形最小化 Oracle (LMO) 理論の下で統一することができる。しかしながら、現在のすべてのメソッドは、更新ルールに対して固定LMOジオメトリを課し、副設計または経験的に選択するが、これは問題の幾何学に従って必ずしも最適ではない。本稿では,各ディープニューラルネットワーク層上のプロキシ最適更新LMOジオメトリを動的に選択するための,新しい効率的なデータ駆動基準を提案する。 SGD から Muon への更新を補間する設計空間を1ステップのランダムな特徴回帰サロゲートモデルを用いて、勾配とアクティベーション統計から閉形式に導いた。さらに,パラメータワイドプリコンディショニングを統合することで,SGD,Muon,Adam,MuAdamを特定のエクストリームとして回復することができる。この適応的なアプローチをスケーラブルにするために、効率的な計算戦略と組み合わせて、高度に最適化されたベースライン上でのランタイムオーバーヘッドを$\sim$ 3%しか達成できません。概念実証として,このデータ駆動型オプティマイザが,MuonとAdamWの3つの異なるトレーニングシナリオにおける最高のパフォーマンスオプティマイザのパフォーマンスに勝っているか,競争力を維持していることを示す。最終的に、この研究は、LMO幾何が実行時データからうまく効率的に適応できることの証拠を提供し、静的な幾何学を超えた最適化設計のための新しい経路を開く。

論文の概要: From SGD to Muon: Adaptive Optimization via Schatten-p Norms

関連論文リスト