Muon Optimizes Under Spectral Norm Constraints
- URL: http://arxiv.org/abs/2506.15054v2
- Date: Mon, 29 Sep 2025 07:34:14 GMT
- Title: Muon Optimizes Under Spectral Norm Constraints
- Authors: Lizhang Chen, Jonathan Li, Qiang Liu,
- Abstract summary: We show that Muon implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices.<n>This perspective allows for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.
- Score: 12.29696026957078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pursuit of faster optimization algorithms remains an active and important research direction in deep learning. Recently, the Muon optimizer [JJB+24] has demonstrated promising empirical performance, but its theoretical foundation remains less understood. In this paper, we bridge this gap and provide a theoretical analysis of Muon by placing it within the Lion-$\mathcal{K}$ family of optimizers [CLLL24]. Specifically, we show that Muon corresponds to Lion-$\mathcal{K}$ when equipped with the nuclear norm, and we leverage the theoretical results of Lion-$\mathcal{K}$ to establish that Muon (with decoupled weight decay) implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices. This perspective not only demystifies the implicit regularization effects of Muon but also leads to natural generalizations through varying the choice of convex map $\mathcal{K}$, allowing for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.
Related papers
- Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z) - Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z) - Towards a Principled Muon under $μ\mathsf{P}$: Ensuring Spectral Conditions throughout Training [0.0]
We show how to reliably guarantee the spectral conditions required by $$P for large language model (LLM) training.<n>We develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process.
arXiv Detail & Related papers (2026-01-04T00:04:05Z) - Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z) - The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization [37.169656352055604]
We introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion.<n>F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.
arXiv Detail & Related papers (2025-12-10T14:25:45Z) - Error Feedback for Muon and Friends [80.90330715662961]
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
arXiv Detail & Related papers (2025-10-01T08:20:08Z) - Muon: Training and Trade-offs with Latent Attention and MoE [4.500362688166346]
We present a comprehensive theoretical and empirical study of the Muon for training transformers only with a small to medium decoder (30M - 200M parameters)<n>We provide rigorous theoretical analysis including: (i)showing the convergence rate under standard assumptions, (ii) spectral regularization properties that prevent gradient explosion, (iii) connection to natural gradient descent on the Stiefel manifold, and (iv) equivalence to steepest gradient descent under the spectral norm.
arXiv Detail & Related papers (2025-09-29T07:51:06Z) - On the Convergence of Muon and Beyond [31.900178928104648]
We provide the first proof that variance reduction enables Muon-MVR2 to attain the optimal complexity.<n>Overall, this work offers the first proof of optimality for a Muon-style.
arXiv Detail & Related papers (2025-09-19T09:43:37Z) - Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning [1.0282274843007797]
We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties.<n>To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems.
arXiv Detail & Related papers (2025-07-06T05:17:28Z) - Convergence Bound and Critical Batch Size of Muon Optimizer [1.2289361708127877]
We provide convergence proofs for Muon across four practical settings.<n>We show that the addition of weight decay yields strictly tighter theoretical bounds.<n>We derive the critical batch size for Muon that minimizes the computational cost of training.
arXiv Detail & Related papers (2025-07-02T11:03:13Z) - Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order [38.99428012275441]
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks.<n>Traditional first-order algorithms incur prohibitive memory and computational costs that scale poorly with model size.<n>We propose zero-order (ZO) optimization methods as a memory- and compute-efficient alternative.
arXiv Detail & Related papers (2025-06-04T20:27:17Z) - On the Convergence Analysis of Muon [19.29806555936508]
We present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD)<n>Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices.
arXiv Detail & Related papers (2025-05-29T17:58:01Z) - Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z) - Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization [19.574602844234814]
We provide a theoretical analysis of motivated matrixization.<n>In particular, we show that the non-Euclisky trust-region method can be seen as a special case.<n>Our findings provide an explanation for several practical observations.
arXiv Detail & Related papers (2025-03-16T20:49:34Z) - Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.<n>Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.<n>We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z) - Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$.
We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - Piecewise Linearity of Min-Norm Solution Map of a Nonconvexly Regularized Convex Sparse Model [8.586951231230596]
We study the piecewise constant sparsity pattern $mathbfx_star(mathbfy,da)$ in each linear zone.
We iteratively computes the closed-form expression of $mathbfx_star(mathbfy,da)$ in each linear zone.
arXiv Detail & Related papers (2023-11-30T10:39:47Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Log-based Sparse Nonnegative Matrix Factorization for Data
Representation [55.72494900138061]
Nonnegative matrix factorization (NMF) has been widely studied in recent years due to its effectiveness in representing nonnegative data with parts-based representations.
We propose a new NMF method with log-norm imposed on the factor matrices to enhance the sparseness.
A novel column-wisely sparse norm, named $ell_2,log$-(pseudo) norm, is proposed to enhance the robustness of the proposed method.
arXiv Detail & Related papers (2022-04-22T11:38:10Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.