Fugu-MT 論文翻訳(概要): Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

論文の概要: Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

arxiv url: http://arxiv.org/abs/2605.18106v1
Date: Mon, 18 May 2026 09:17:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.22042
Title: Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Title（参考訳）: 最適設計のための対称性に適合する原理:埋め込み, LMヘッド, SwiGLU MLP, MoEルータ
Authors: Tim Tsz-Kit Lau, Weijie Su,
Abstract要約: 深層学習の実践において、目覚ましい幾何学的格差は長く続いている。勾配更新規則は、対応する重みブロックに作用するパラメータ群の下で同値であるべきである。
参考スコア（独自算出の注目度）: 3.433766572511366
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.
Abstract（参考訳）: 深層学習の実践において、目覚ましい幾何学的格差は長く続いている。現代のニューラルネットワークアーキテクチャは自然にリッチ対称性と等分散性を示すが、アダムやその変種のような一般的なオプティマイザは本質的に座標的に機能し、パラメータ空間の等分散構造を尊重することができない。この相違に対処するために、最適化設計のための対称性互換の原理を導入する: 勾配更新規則は対応する重みブロックに作用する対称性群の下で不変であるべきである。この原理に従うと、確率スペクトル降下法、ムオン法、シオン法、極勾配法などを用いて、一般行列層に対する二直交等変更新について統一的な視点を提供する。さらに,直交群から置換群,共有シフト対称性に移行することにより,埋め込みやLMヘッド行列,SwiGLU MLPプロジェクション,MoEルータ行列といった一般的な行列層と対称性が異なるパラメータブロックに対して,対称性に適合する最適化器を導出する。これらの構成には、片側スペクトル、行ノルム、ハイブリッド行ノルム/スペクトル、行認識、列認識、中心行ノルム、左スペクトル更新が含まれる。それらは、各主要な行列値パラメータクラスが、その対称性群に同値な更新を割り当てられる、エンドツーエンドのレイヤーワイズオプティマイザスタックを生成する。我々は,この原理を,Qwen3-0.6Bスタイル,Gemma 3 1Bスタイル,OLMoE-1B-7Bスタイル,小型のgpt-ossアーキテクチャなど,高密度でスパースなMoE言語モデルの事前学習実験を通じて裏付ける。これらの実験全体で、対称性に適合した更新は最終的な検証損失を継続的に改善し、いくつかのケースでは、対応するAdamW更新よりも安定性をトレーニングする。

論文の概要: Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

関連論文リスト