Related papers: Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

URL: http://arxiv.org/abs/2509.11983v1
Date: Mon, 15 Sep 2025 14:28:53 GMT
Title: Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
Authors: Chuan He, Zhanwang Deng, Zhaosong Lu,
Abstract summary: Recently, the Muon citejordanmuon has gained significant attention for its strong performance in foundation model training.<n>We propose low-rank matrix-signed gradient descent and a low-rank variant of Muon.
Score: 3.1922198632169327
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \cite{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose {\it low-rank orthogonalization}, which explicitly leverages the low-rank nature of gradients during NN training. Building on this, we propose low-rank matrix-signed gradient descent and a low-rank variant of Muon. Our numerical experiments demonstrate the superior performance of low-rank orthogonalization, with the low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the performance of the carefully tuned vanilla Muon. Theoretically, we establish the iteration complexity of the low-rank matrix-signed gradient descent for finding an approximate stationary solution, as well as that of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise.

Related papers

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws [23.350512542598803]
We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs.<n>We show that Muon mitigates this imbalance, leading to faster and more uniform progress.
arXiv Detail & Related papers (2026-02-05T14:49:40Z)
Preconditioning Benefits of Spectral Orthogonalization in Muon [50.62925024212989]
We study the effectiveness of a simplified variant of Muon in two case studies: matrix factorization and in-context learning of linear transformers.<n>Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior.
arXiv Detail & Related papers (2026-01-20T00:08:31Z)
DeMuon: A Decentralized Muon for Matrix Optimization over Graphs [20.832302616074966]
DeMuon is a method for decentralized matrix optimization over a given communication topology.<n>We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity.
arXiv Detail & Related papers (2025-10-01T19:06:11Z)
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates [5.049533819651459]
We propose a new adaptive update, AdaGO, which combines a norm-based update with aGrad-type step.<n>AdaGO preserves the orthogonality of the update, which can be interpreted as a spectral descent, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradients.
arXiv Detail & Related papers (2025-09-03T03:42:22Z)
A Riemannian Optimization Perspective of the Gauss-Newton Method for Feedforward Neural Networks [3.48097307252416]
We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions.<n>We show that the Levenberg-Marquardt dynamics with an appropriately chosen damping schedule yields fast convergence rate despite potentially ill-conditioned neural tangent kernel matrices.
arXiv Detail & Related papers (2024-12-18T16:51:47Z)
Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity [59.75300530380427]
We consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries. We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds.
arXiv Detail & Related papers (2024-06-28T02:56:22Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
On Learning Gaussian Multi-index Models with Gradient Flow [57.170617397894404]
We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection.
arXiv Detail & Related papers (2023-10-30T17:55:28Z)
Semi-Supervised Laplace Learning on Stiefel Manifolds [48.3427853588646]
We develop the framework Sequential Subspace for graph-based, supervised samples at low-label rates. We achieves that our methods at extremely low rates, and high label rates.
arXiv Detail & Related papers (2023-07-31T20:19:36Z)
Multi-View Spectral Clustering Tailored Tensor Low-Rank Representation [105.33409035876691]
This paper explores the problem of multi-view spectral clustering (MVSC) based on tensor low-rank modeling. We design a novel structured tensor low-rank norm tailored to MVSC. We show that the proposed method outperforms state-of-the-art methods to a significant extent.
arXiv Detail & Related papers (2020-04-30T11:52:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.