Related papers: Convergence Bound and Critical Batch Size of Muon Optimizer

Convergence Bound and Critical Batch Size of Muon Optimizer

URL: http://arxiv.org/abs/2507.01598v2
Date: Mon, 04 Aug 2025 04:29:15 GMT
Title: Convergence Bound and Critical Batch Size of Muon Optimizer
Authors: Naoki Sato, Hiroki Naganuma, Hideaki Iiduka,
Abstract summary: We provide convergence proofs for Muon across four practical settings.<n>We show that the addition of weight decay yields strictly tighter theoretical bounds.<n>We derive the critical batch size for Muon that minimizes the computational cost of training.
Score: 1.2289361708127877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings.

Related papers

On the Convergence Analysis of Muon [19.29806555936508]
We present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD)<n>Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices.
arXiv Detail & Related papers (2025-05-29T17:58:01Z)
On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts [66.39976432286905]
We study the convergence rates of the maximum likelihood estimator of gating and prompt parameters.<n>We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model.
arXiv Detail & Related papers (2025-05-24T01:30:46Z)
Advancing Constrained Monotonic Neural Networks: Achieving Universal Approximation Beyond Bounded Activations [4.659033572014701]
We show that convex monotone activations and non-positive constrained weights qualify as universal approximators.<n>We propose an alternative formulation that allows the network to adjust its activations according to the sign of the weights.
arXiv Detail & Related papers (2025-05-05T10:18:48Z)
Transition of $α$-mixing in Random Iterations with Applications in Queuing Theory [0.0]
We show the transfer of mixing properties from the exogenous regressor to the response via coupling arguments.<n>We also study Markov chains in random environments with drift and minorization conditions, even under non-stationary environments.
arXiv Detail & Related papers (2024-10-07T14:13:37Z)
E$^2$M: Double Bounded $α$-Divergence Optimization for Tensor-based Discrete Density Estimation [3.9633191508712398]
We present a generalization of the expectation-maximization (EM) algorithm, called E$2M algorithm.<n>It circumvents this issue by first relaxing the optimization into minimization of a surrogate objective based on the Kullback-Leibler (KL) divergence.<n>Our approach offers flexible modeling for not only a variety of low-rank structures, including the CP, Tucker, and Train formats.
arXiv Detail & Related papers (2024-05-28T14:28:28Z)
Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks [78.11734286268455]
We study the performance of ConvResNeXts, trained with weight decay from the perspective of nonparametric classification.<n>Our analysis allows for infinitely many building blocks in ConvResNeXts, and shows that weight decay implicitly enforces sparsity on these blocks.
arXiv Detail & Related papers (2023-07-04T11:08:03Z)
Efficient Bound of Lipschitz Constant for Convolutional Layers by Gram Iteration [122.51142131506639]
We introduce a precise, fast, and differentiable upper bound for the spectral norm of convolutional layers using circulant matrix theory. We show through a comprehensive set of experiments that our approach outperforms other state-of-the-art methods in terms of precision, computational cost, and scalability. It proves highly effective for the Lipschitz regularization of convolutional neural networks, with competitive results against concurrent approaches.
arXiv Detail & Related papers (2023-05-25T15:32:21Z)
Non-Parametric Learning of Stochastic Differential Equations with Non-asymptotic Fast Rates of Convergence [65.63201894457404]
We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of non-linear differential equations.<n>The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations.
arXiv Detail & Related papers (2023-05-24T20:43:47Z)
A comprehensive theoretical framework for the optimization of neural networks classification performance with respect to weighted metrics [1.0499611180329804]
In many contexts, customized and weighted classification scores are designed in order to evaluate the goodness of predictions carried out by neural networks. We provide a complete setting that formalizes weighted classification metrics and allows the construction of losses that drive the model to optimize these interest.
arXiv Detail & Related papers (2023-05-22T20:33:29Z)
Sampling with Mollified Interaction Energy Descent [57.00583139477843]
We present a new optimization-based method for sampling called mollified interaction energy descent (MIED) MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs) We show experimentally that for unconstrained sampling problems our algorithm performs on par with existing particle-based algorithms like SVGD.
arXiv Detail & Related papers (2022-10-24T16:54:18Z)
FIT: A Metric for Model Sensitivity [1.2622086660704197]
We propose FIT, which combines the Fisher information with a model of quantization. We find that FIT can estimate the final performance of a network without retraining. FIT is fast to compute when compared to existing methods, demonstrating favourable convergence properties.
arXiv Detail & Related papers (2022-10-16T10:25:29Z)
Beyond Smoothness: Incorporating Low-Rank Analysis into Nonparametric Density Estimation [20.38883021295225]
We introduce a new nonparametric latent variable model based on the Tucker decomposition. A rudimentary implementation of our estimators experimentally demonstrates a considerable performance improvement over the standard histogram estimator.
arXiv Detail & Related papers (2022-04-02T19:45:07Z)
Controlling the Complexity and Lipschitz Constant improves polynomial nets [55.121200972539114]
We derive new complexity bounds for the set of Coupled CP-Decomposition (CCP) and Nested Coupled CP-decomposition (NCP) models of Polynomial Nets. We propose a principled regularization scheme that we evaluate experimentally in six datasets and show that it improves the accuracy as well as the robustness of the models to adversarial perturbations.
arXiv Detail & Related papers (2022-02-10T14:54:29Z)
On Convergence of Training Loss Without Reaching Stationary Points [62.41370821014218]
We show that Neural Network weight variables do not converge to stationary points where the gradient the loss function vanishes. We propose a new perspective based on ergodic theory dynamical systems.
arXiv Detail & Related papers (2021-10-12T18:12:23Z)
Machine Learning and Variational Algorithms for Lattice Field Theory [1.198562319289569]
In lattice quantum field theory studies, parameters defining the lattice theory must be tuned toward criticality to access continuum physics. We introduce an approach to "deform" Monte Carlo estimators based on contour deformations applied to the domain of the path integral. We demonstrate that flow-based MCMC can mitigate critical slowing down and observifolds can exponentially reduce variance in proof-of-principle applications.
arXiv Detail & Related papers (2021-06-03T16:37:05Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution. We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.