Related papers: Muon Optimizes Under Spectral Norm Constraints

Muon Optimizes Under Spectral Norm Constraints

URL: http://arxiv.org/abs/2506.15054v1
Date: Wed, 18 Jun 2025 01:32:39 GMT
Title: Muon Optimizes Under Spectral Norm Constraints
Authors: Lizhang Chen, Jonathan Li, Qiang Liu,
Abstract summary: We show that Muon implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices.<n>This perspective allows for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.
Score: 12.57291626702513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The pursuit of faster optimization algorithms remains an active and important research direction in deep learning. Recently, the Muon optimizer [JJB+24] has demonstrated promising empirical performance, but its theoretical foundation remains less understood. In this paper, we bridge this gap and provide a theoretical analysis of Muon by placing it within the Lion-$\mathcal{K}$ family of optimizers [CLLL24]. Specifically, we show that Muon corresponds to Lion-$\mathcal{K}$ when equipped with the nuclear norm, and we leverage the theoretical results of Lion-$\mathcal{K}$ to establish that Muon (with decoupled weight decay) implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices. This perspective not only demystifies the implicit regularization effects of Muon but also leads to natural generalizations through varying the choice of convex map $\mathcal{K}$, allowing for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.

Related papers

Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning [1.0282274843007797]
We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties.<n>To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems.
arXiv Detail & Related papers (2025-07-06T05:17:28Z)
Convergence Bound and Critical Batch Size of Muon Optimizer [1.2289361708127877]
We provide convergence proofs for Muon across four practical settings.<n>We show that the addition of weight decay yields strictly tighter theoretical bounds.<n>We derive the critical batch size for Muon that minimizes the computational cost of training.
arXiv Detail & Related papers (2025-07-02T11:03:13Z)
Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order [38.99428012275441]
Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks.<n>Traditional first-order algorithms incur prohibitive memory and computational costs that scale poorly with model size.<n>We propose zero-order (ZO) optimization methods as a memory- and compute-efficient alternative.
arXiv Detail & Related papers (2025-06-04T20:27:17Z)
On the Convergence Analysis of Muon [19.29806555936508]
We present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD)<n>Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices.
arXiv Detail & Related papers (2025-05-29T17:58:01Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization [19.574602844234814]
We provide a theoretical analysis of motivated matrixization.<n>In particular, we show that the non-Euclisky trust-region method can be seen as a special case.<n>Our findings provide an explanation for several practical observations.
arXiv Detail & Related papers (2025-03-16T20:49:34Z)
Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.<n>Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.<n>We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$. We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z)
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z)
Piecewise Linearity of Min-Norm Solution Map of a Nonconvexly Regularized Convex Sparse Model [8.586951231230596]
We study the piecewise constant sparsity pattern $mathbfx_star(mathbfy,da)$ in each linear zone. We iteratively computes the closed-form expression of $mathbfx_star(mathbfy,da)$ in each linear zone.
arXiv Detail & Related papers (2023-11-30T10:39:47Z)
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks. We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z)
Log-based Sparse Nonnegative Matrix Factorization for Data Representation [55.72494900138061]
Nonnegative matrix factorization (NMF) has been widely studied in recent years due to its effectiveness in representing nonnegative data with parts-based representations. We propose a new NMF method with log-norm imposed on the factor matrices to enhance the sparseness. A novel column-wisely sparse norm, named $ell_2,log$-(pseudo) norm, is proposed to enhance the robustness of the proposed method.
arXiv Detail & Related papers (2022-04-22T11:38:10Z)
Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.