Fugu-MT 論文翻訳(概要): NorMuon: Making Muon more efficient and scalable

論文の概要: NorMuon: Making Muon more efficient and scalable

arxiv url: http://arxiv.org/abs/2510.05491v1
Date: Tue, 07 Oct 2025 01:13:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.052014
Title: NorMuon: Making Muon more efficient and scalable
Title（参考訳）: NorMuon: Muonをより効率的でスケーラブルにする
Authors: Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, Tuo Zhao,
Abstract要約: 我々はアダムの後継としてノームーンを提案する。我々は、NorMuonがAdamとMuonの両方を一貫して上回り、Adamより21.74%、Muonより11.31%改善していることを示す。
参考スコア（独自算出の注目度）: 71.49702449498085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon's emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.
Abstract（参考訳）: オプティマイザの選択は、大規模言語モデル(LLM)の訓練効率と計算コストに大きな影響を及ぼす。近年、Muonオプティマイザは、パラメータ更新の直交化や、より良い条件付けによる最適化の最適化を改善することで、有望な結果を証明している。ムーンがアダムの後継候補として台頭したにも拘わらず、その強みを共同で活用する可能性については体系的に検討されていない。本研究では、直交化とニューロンレベルの適応学習率を相乗的に結合する最適化器であるNorMuon(Neuron-wise Normalized Muon)を提案することにより、このギャップを埋める。解析の結果、Muonは条件数を大幅に削減するが、結果として得られる更新は非一様ニューロンノルムが非常に高く、特定のニューロンが最適化プロセスを支配していることが明らかとなった。 NorMuonはこの不均衡に対処するため、各ニューロンの2階運動量統計を維持し、直交後の行次正規化を適用し、Muonの条件付けの利点を保ちながらパラメータ利用のバランスを確保する。そこで我々は,FSDP2フレームワークを用いて,デバイス間の直交化計算を戦略的に分散する効率的な分散実装を開発する。複数のモデルスケールでの実験では、NorMuonはAdamとMuonの両方を一貫して上回り、Adamより21.74%、Muonより11.31%向上し、Muonに匹敵するメモリフットプリントを維持している。その結果, 直交化と適応学習は競合するアプローチよりも補完的であり, 大規模深層学習における最適化設計への新たな道を開くことが示唆された。

論文の概要: NorMuon: Making Muon more efficient and scalable

関連論文リスト