Related papers: Dion2: A Simple Method to Shrink Matrix in Muon

Dion2: A Simple Method to Shrink Matrix in Muon

URL: http://arxiv.org/abs/2512.16928v1
Date: Mon, 01 Dec 2025 16:58:10 GMT
Title: Dion2: A Simple Method to Shrink Matrix in Muon
Authors: Kwangjun Ahn, Noah Amsel, John Langford,
Abstract summary: We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon's iteration compared to prior approaches.<n>At a high level, Dion2 selects a fraction of rows or columns at each and orthonormalizes only those.
Score: 19.766325230655173
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Muon optimizer enjoys strong empirical performance and theoretical grounding. However, the super-linear cost of its orthonormalization step introduces increasing overhead with scale. To alleviate this cost, several works have attempted to reduce the size of the matrix entering the orthonormalization step. We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon's computation compared to prior approaches. At a high level, Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those. This sampling procedure makes the update sparse, reducing both computation and communication costs which in turn improves the scalability of Muon.

Related papers

Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z)
MuonBP: Faster Muon via Block-Periodic Orthogonalization [24.232069944820513]
We show how to adjust the learning rate from the baseline to MuonBP and give guarantees for this algorithm.<n>When training an 8B model with eight-way tensor tensor and ZeRO statewiseing, MuonBP achieves 8% Muon with no degradation in performance.
arXiv Detail & Related papers (2025-10-19T19:56:05Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning [1.0282274843007797]
We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties.<n>To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems.
arXiv Detail & Related papers (2025-07-06T05:17:28Z)
Orthogonal Finetuning Made Scalable [92.34573849209238]
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment.<n>We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multiplications with cubic complexity.<n>We propose OFTv2, an input-centric reformulation that instead uses matrix-vector multiplications (i.e., matrix-free computation), reducing the computational cost to quadratic.<n>These modifications allow OFTv2 to achieve up to 10x faster training and 3x lower GPU memory usage without
arXiv Detail & Related papers (2025-06-24T17:59:49Z)
Iterative Orthogonalization Scaling Laws [0.0]
The muon has picked up much attention as of late as a possible replacement to the seemingly omnipresent Adam matrices.<n>This paper shows this scaling behavior theoretically and empirically on random matrices but does not suggest what to do about it.
arXiv Detail & Related papers (2025-05-06T22:34:55Z)
Dion: Distributed Orthonormalized Updates [27.66769374729482]
We introduce Dion (Distributed Orthonormalization), a scalable and efficient update rule.<n>It replaces Newton-Schulz iteration with amortized power iteration on a momentum buffer.<n>The rank-fraction parameter with error feedback enables low-rank updates that balance quality with significant cost savings.
arXiv Detail & Related papers (2025-04-07T17:49:37Z)
Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
Reducing the Variance of Gaussian Process Hyperparameter Optimization with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication. We prove that preconditioning has an additional benefit that has been previously unexplored. It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.