Related papers: DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

URL: http://arxiv.org/abs/2510.01377v1
Date: Wed, 01 Oct 2025 19:06:11 GMT
Title: DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
Authors: Chuan He, Shuyi Ren, Jingwei Mao, Erik G. Larsson,
Abstract summary: DeMuon is a method for decentralized matrix optimization over a given communication topology.<n>We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity.
Score: 20.832302616074966
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

Related papers

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws [23.350512542598803]
We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs.<n>We show that Muon mitigates this imbalance, leading to faster and more uniform progress.
arXiv Detail & Related papers (2026-02-05T14:49:40Z)
MuonBP: Faster Muon via Block-Periodic Orthogonalization [24.232069944820513]
We show how to adjust the learning rate from the baseline to MuonBP and give guarantees for this algorithm.<n>When training an 8B model with eight-way tensor tensor and ZeRO statewiseing, MuonBP achieves 8% Muon with no degradation in performance.
arXiv Detail & Related papers (2025-10-19T19:56:05Z)
NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z)
On Provable Benefits of Muon in Federated Learning [23.850171320924574]
The experiments recently introduced, Muon, has gained increasing attention due to its superior performance across a wide range of applications.<n>This paper investigates this federated performance of Muon in the unexplored setting of the Fedon learning algorithm.
arXiv Detail & Related papers (2025-10-04T16:27:09Z)
Error Feedback for Muon and Friends [80.90330715662961]
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
arXiv Detail & Related papers (2025-10-01T08:20:08Z)
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training [3.1922198632169327]
Recently, the Muon citejordanmuon has gained significant attention for its strong performance in foundation model training.<n>We propose low-rank matrix-signed gradient descent and a low-rank variant of Muon.
arXiv Detail & Related papers (2025-09-15T14:28:53Z)
Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture.<n>Non-smooth regularization is often incorporated into machine learning tasks.<n>We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Data-heterogeneity-aware Mixing for Decentralized Learning [63.83913592085953]
We characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes. We propose a metric that quantifies the ability of a graph to mix the current gradients. Motivated by our analysis, we propose an approach that periodically and efficiently optimize the metric.
arXiv Detail & Related papers (2022-04-13T15:54:35Z)
Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning [102.26119328920547]
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients. We propose a general algorithmic framework, Mime, which mitigates client drift and adapts arbitrary centralized optimization algorithms.
arXiv Detail & Related papers (2020-08-08T21:55:07Z)
Quantized Decentralized Stochastic Learning over Directed Graphs [54.005946490293496]
We consider a decentralized learning problem where data points are distributed among computing nodes communicating over a directed graph.<n>As the model size gets large, decentralized learning faces a major bottleneck that is the communication load due to each node transmitting messages (model updates) to its neighbors.<n>We propose the quantized decentralized learning algorithm over directed graphs that is based on the push-sum algorithm in decentralized consensus optimization.
arXiv Detail & Related papers (2020-02-23T18:25:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.