Related papers: The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization

The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization

URL: http://arxiv.org/abs/2512.09678v1
Date: Wed, 10 Dec 2025 14:25:45 GMT
Title: The Ky Fan Norms and Beyond: Dual Norms and Combinations for Matrix Optimization
Authors: Alexey Kravatskiy, Ivan Kozyrev, Nikolai Kozlov, Alexander Vinogradov, Daniil Merkulov, Ivan Oseledets,
Abstract summary: We introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion.<n>F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.
Score: 37.169656352055604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this article, we explore the use of various matrix norms for optimizing functions of weight matrices, a crucial problem in training large language models. Moving beyond the spectral norm underlying the Muon update, we leverage duals of the Ky Fan $k$-norms to introduce a family of Muon-like algorithms we name Fanions, which are closely related to Dion. By working with duals of convex combinations of the Ky Fan $k$-norms with either the Frobenius norm or the $l_\infty$ norm, we construct the families of F-Fanions and S-Fanions, respectively. Their most prominent members are F-Muon and S-Muon. We complement our theoretical analysis with an extensive empirical study of these algorithms across a wide range of tasks and settings, demonstrating that F-Muon and S-Muon consistently match Muon's performance, while outperforming vanilla Muon on a synthetic linear least squares problem.

Related papers

NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training [50.27276603708547]
We show that despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines.<n>We propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure.
arXiv Detail & Related papers (2026-03-04T00:10:14Z)
Muon in Associative Memory Learning: Training Dynamics and Scaling Laws [23.350512542598803]
We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs.<n>We show that Muon mitigates this imbalance, leading to faster and more uniform progress.
arXiv Detail & Related papers (2026-02-05T14:49:40Z)
Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z)
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods [50.070182958880146]
We propose a unified framework generalizing descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms.<n>Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix- parameterized setting.<n>We introduce two new methods, $ttMuAdam$ and $texttMuAdam-SANIA$, which combine the spectral geometry of Muon with Adam-style preconditioning.
arXiv Detail & Related papers (2025-10-12T19:39:41Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
Error Feedback for Muon and Friends [80.90330715662961]
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
arXiv Detail & Related papers (2025-10-01T08:20:08Z)
On the Convergence of Muon and Beyond [31.900178928104648]
We provide the first proof that variance reduction enables Muon-MVR2 to attain the optimal complexity.<n>Overall, this work offers the first proof of optimality for a Muon-style.
arXiv Detail & Related papers (2025-09-19T09:43:37Z)
Muon Optimizes Under Spectral Norm Constraints [12.29696026957078]
We show that Muon implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices.<n>This perspective allows for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.
arXiv Detail & Related papers (2025-06-18T01:32:39Z)
On the Convergence Analysis of Muon [19.29806555936508]
We present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD)<n>Our theoretical results reveal that Muon can benefit from the low-rank and approximate blockwise diagonal structure of Hessian matrices.
arXiv Detail & Related papers (2025-05-29T17:58:01Z)
Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data [33.082961718280245]
We provide the first complete characterization of implicit optimization bias for p-norm normalized steepest descent (NSD) and momentum steepest descent (NMD)<n>Our results prove that these algorithms converge to solutions maximizing the margin with respect to the matrix's p-norm, with established convergence rates.
arXiv Detail & Related papers (2025-02-07T05:09:32Z)
Log-based Sparse Nonnegative Matrix Factorization for Data Representation [55.72494900138061]
Nonnegative matrix factorization (NMF) has been widely studied in recent years due to its effectiveness in representing nonnegative data with parts-based representations. We propose a new NMF method with log-norm imposed on the factor matrices to enhance the sparseness. A novel column-wisely sparse norm, named $ell_2,log$-(pseudo) norm, is proposed to enhance the robustness of the proposed method.
arXiv Detail & Related papers (2022-04-22T11:38:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.