Related papers: Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning

Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning

URL: http://arxiv.org/abs/2602.16167v1
Date: Wed, 18 Feb 2026 03:56:20 GMT
Title: Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning
Authors: Binghang Lu, Jiahao Zhang, Guang Lin,
Abstract summary: SpecMuon is a spectral-aware, multi-mode gradient flow for physics-informed learning.<n>It regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties.<n>It achieves faster convergence and improved stability compared with Adam AdamW.
Score: 10.647088281181222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Physics-informed neural networks and neural operators often suffer from severe optimization difficulties caused by ill-conditioned gradients, multi-scale spectral behavior, and stiffness induced by physical constraints. Recently, the Muon optimizer has shown promise by performing orthogonalized updates in the singular-vector basis of the gradient, thereby improving geometric conditioning. However, its unit-singular-value updates may lead to overly aggressive steps and lack explicit stability guarantees when applied to physics-informed learning. In this work, we propose SpecMuon, a spectral-aware optimizer that integrates Muon's orthogonalized geometry with a mode-wise relaxed scalar auxiliary variable (RSAV) mechanism. By decomposing matrix-valued gradients into singular modes and applying RSAV updates individually along dominant spectral directions, SpecMuon adaptively regulates step sizes according to the global loss energy while preserving Muon's scale-balancing properties. This formulation interprets optimization as a multi-mode gradient flow and enables principled control of stiff spectral components. We establish rigorous theoretical properties of SpecMuon, including a modified energy dissipation law, positivity and boundedness of auxiliary variables, and global convergence with a linear rate under the Polyak-Lojasiewicz condition. Numerical experiments on physics-informed neural networks, DeepONets, and fractional PINN-DeepONets demonstrate that SpecMuon achieves faster convergence and improved stability compared with Adam, AdamW, and the original Muon optimizer on benchmark problems such as the one-dimensional Burgers equation and fractional partial differential equations.

Related papers

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization [40.95701844244596]
We show that ZO optimization can be substantially improved by unifying two complementary principles.<n>We instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon in the ZO setting.
arXiv Detail & Related papers (2026-02-19T08:08:33Z)
Variational Entropic Optimal Transport [67.76725267984578]
We propose Variational Entropic Optimal Transport (VarEOT) for domain translation problems.<n>VarEOT is based on an exact variational reformulation of the log-partition $log mathbbE[exp(cdot)$ as a tractable generalization over an auxiliary positive normalizer.<n> Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality.
arXiv Detail & Related papers (2026-02-02T15:48:44Z)
Physics-Informed Chebyshev Polynomial Neural Operator for Parametric Partial Differential Equations [17.758049557300826]
We introduce the Physics-Informed Chebyshev Polynomial Neural Operator (CPNO)<n>CPNO replaces unstable monomial expansions with numerically stable Chebyshev spectral basis.<n> Experiments on benchmark parameterized PDEs show that CPNO achieves superior accuracy, faster convergence, and enhanced robustness to hyper parameters.
arXiv Detail & Related papers (2026-02-02T07:19:56Z)
FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer [30.184978506988767]
We introduce FISMO, which incorporates anisotropic neuralotropic geometry information through Fisher information geometry.<n> FISMO achieves superior efficiency and final performance compared to established baselines.
arXiv Detail & Related papers (2026-01-29T14:05:04Z)
Majorization-Minimization Networks for Inverse Problems: An Application to EEG Imaging [4.063392865490957]
Inverse problems are often ill-posed and require optimization schemes with strong stability and convergence guarantees.<n>We propose a learned Majorization-Minimization (MM) framework for inverse problems within a bilevel optimization setting.<n>We learn a structured curvature majorant that governs each MM step while preserving classical MM descent guarantees.
arXiv Detail & Related papers (2026-01-23T10:33:45Z)
Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z)
Neural Optimal Transport Meets Multivariate Conformal Prediction [58.43397908730771]
We propose a framework for conditional vectorile regression (CVQR)<n>CVQR combines neural optimal transport with quantized optimization, and apply it to predictions.
arXiv Detail & Related papers (2025-09-29T19:50:19Z)
Reparameterized LLM Training via Orthogonal Equivalence Transformation [54.80172809738605]
We present POET, a novel training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons.<n>POET can stably optimize the objective function with improved generalization.<n>We develop efficient approximations that make POET flexible and scalable for training large-scale neural networks.
arXiv Detail & Related papers (2025-06-09T17:59:34Z)
AYLA: Amplifying Gradient Sensitivity via Loss Transformation in Non-Convex Optimization [0.0]
Gradient Descent (SGD) and its variants, such as ADAM, are foundational to deep learning optimization.<n>This paper introduces AYLA, a novel framework that enhances dynamics training.
arXiv Detail & Related papers (2025-04-02T16:31:39Z)
Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z)
NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively. We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z)
Neural Control Variates [71.42768823631918]
We show that a set of neural networks can face the challenge of finding a good approximation of the integrand. We derive a theoretically optimal, variance-minimizing loss function, and propose an alternative, composite loss for stable online training in practice. Specifically, we show that the learned light-field approximation is of sufficient quality for high-order bounces, allowing us to omit the error correction and thereby dramatically reduce the noise at the cost of negligible visible bias.
arXiv Detail & Related papers (2020-06-02T11:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.