Related papers: TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

URL: http://arxiv.org/abs/2602.13498v1
Date: Fri, 13 Feb 2026 22:11:59 GMT
Title: TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers
Authors: Peng Cheng, Jiucheng Zang, Qingnan Li, Liheng Ma, Yufei Cui, Yingxue Zhang, Boxing Chen, Ming Jian, Wen Tong,
Abstract summary: TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping.<n>We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers.<n> Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines.
Score: 24.534939825452884
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.

Related papers

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum [5.049533819651459]
We propose a new and a diagonal extension, NAMO and NAMO-D, to provide the first principled integration of momentum with noise adaptation.<n> NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries.<n>Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon models.
arXiv Detail & Related papers (2026-02-19T05:00:39Z)
Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning [10.647088281181222]
SpecMuon is a spectral-aware, multi-mode gradient flow for physics-informed learning.<n>It regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties.<n>It achieves faster convergence and improved stability compared with Adam AdamW.
arXiv Detail & Related papers (2026-02-18T03:56:20Z)
Delving into Muon and Beyond: Deep Analysis and Extensions [8.297062899157664]
We study Muon as the p = 0 endpoint of a family of spectral transformations of the form U boldsymbolp V'.<n>We find that RMS-normalized updates yield more stable optimization than first-moment updates.<n>Our results suggest that Muon is best understood as an effective form of spectral normalization, but not a universally superior optimization method.
arXiv Detail & Related papers (2026-02-04T15:40:47Z)
Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations [55.047454145941366]
Streaming Merging is an innovative model updating paradigm that conceptualizes merging as an iterative optimization process.<n> ARM is a strategy designed to approximate gradient descent dynamics.<n> ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model.
arXiv Detail & Related papers (2026-02-03T08:15:57Z)
Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion [0.0]
Current paradigms impose a static compromise on information channel drift parameters.<n>We introduce a "low-dimensional" exploration model and a "low-dimensional" dynamic alignment framework.
arXiv Detail & Related papers (2025-12-01T17:04:17Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
Error Feedback for Muon and Friends [80.90330715662961]
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
arXiv Detail & Related papers (2025-10-01T08:20:08Z)
AdaMuon: Adaptive Muon Optimizer [11.281916426508216]
AdaMuon combines element-wise adaptivity with orthogonal updates for large-scale neural network training.<n>AdaMuon maintains stability but can surpass Adam by more than 40% training efficiency in large-scale scenarios.
arXiv Detail & Related papers (2025-07-15T05:49:37Z)
Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
Hyperspherical Normalization for Scalable Deep Reinforcement Learning [57.016639036237315]
SimbaV2 is a novel reinforcement learning architecture designed to stabilize optimization.<n>It scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks.
arXiv Detail & Related papers (2025-02-21T08:17:24Z)
Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.