Related papers: On the Convergence Analysis of Muon

On the Convergence Analysis of Muon

URL: http://arxiv.org/abs/2505.23737v1
Date: Thu, 29 May 2025 17:58:01 GMT
Title: On the Convergence Analysis of Muon
Authors: Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, Jiawei Zhang,
Abstract summary: We present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD)<n>Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices.
Score: 19.29806555936508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We further characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices -- phenomena widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.

Related papers

Fourier Neural Operators for Non-Markovian Processes:Approximation Theorems and Experiments [2.84475965465923]
This paper introduces an operator-based neural network, the mirror-padded neural operator (MFNO)<n>MFNO extends the standard Fourier neural operator (FNO) by incorporating mirror padding, enabling it to handle non-periodic inputs.<n>We rigorously prove that MFNOs can approximate solutions of path-dependent differential equations and transformations of fractional Brownian motions to an arbitrary degree of accuracy.
arXiv Detail & Related papers (2025-07-23T19:30:34Z)
Convergence Bound and Critical Batch Size of Muon Optimizer [1.2289361708127877]
We provide convergence proofs for Muon across four practical settings.<n>We show that the addition of weight decay yields strictly tighter theoretical bounds.<n>We derive the critical batch size for Muon that minimizes the computational cost of training.
arXiv Detail & Related papers (2025-07-02T11:03:13Z)
Muon Optimizes Under Spectral Norm Constraints [12.57291626702513]
We show that Muon implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices.<n>This perspective allows for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.
arXiv Detail & Related papers (2025-06-18T01:32:39Z)
PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective [6.497756166630786]
We introduce a unifying framework for analyzing "matrix-aware" preconditioned methods.<n>We introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients.
arXiv Detail & Related papers (2025-05-27T22:11:21Z)
Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization Perspective [3.48097307252416]
We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions.<n>We show that the Levenberg-Marquardt dynamics with an appropriately chosen damping factor yields robustness to ill-conditioned kernels.
arXiv Detail & Related papers (2024-12-18T16:51:47Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference. Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z)
Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges [132.62934175555145]
Neural Collapse (NC) is a well-known phenomenon of deep neural networks in the terminal phase of training (TPT) We propose a theoretical explanation for why continuing training can still lead to accuracy improvement on test set, even after the train accuracy has reached 100%. We refer to this newly discovered property as "non-conservative generalization"
arXiv Detail & Related papers (2023-10-12T14:29:02Z)
Stochastic normalizing flows as non-equilibrium transformations [62.997667081978825]
We show that normalizing flows provide a route to sample lattice field theories more efficiently than conventional MonteCarlo simulations. We lay out a strategy to optimize the efficiency of this extended class of generative models and present examples of applications.
arXiv Detail & Related papers (2022-01-21T19:00:18Z)
Generalization Properties of Stochastic Optimizers via Trajectory Analysis [48.38493838310503]
We show that both the Fernique-Talagrand functional and the local powerlaw are predictive of generalization performance. We show that both our Fernique-Talagrand functional and the local powerlaw are predictive of generalization performance.
arXiv Detail & Related papers (2021-08-02T10:58:32Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.