FedMuon: Federated Learning with Bias-corrected LMO-based Optimization
- URL: http://arxiv.org/abs/2509.26337v1
- Date: Tue, 30 Sep 2025 14:45:12 GMT
- Title: FedMuon: Federated Learning with Bias-corrected LMO-based Optimization
- Authors: Yuki Takezawa, Anastasia Koloskova, Xiaowen Jiang, Sebastian U. Stich,
- Abstract summary: We study how Muon can be utilized in federated learning.<n>We demonstrate that FedMuon can outperform the state-of-the-art federated learning methods.
- Score: 36.00641661700195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, a new optimization method based on the linear minimization oracle (LMO), called Muon, has been attracting increasing attention since it can train neural networks faster than existing adaptive optimization methods, such as Adam. In this paper, we study how Muon can be utilized in federated learning. We first show that straightforwardly using Muon as the local optimizer of FedAvg does not converge to the stationary point since the LMO is a biased operator. We then propose FedMuon which can mitigate this issue. We also analyze how solving the LMO approximately affects the convergence rate and find that, surprisingly, FedMuon can converge for any number of Newton-Schulz iterations, while it can converge faster as we solve the LMO more accurately. Through experiments, we demonstrated that FedMuon can outperform the state-of-the-art federated learning methods.
Related papers
- To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters [16.624341041698013]
Muon has perhaps gained the highest popularity due to its superior training speed.<n>This paper investigates the potential downsides stemming from the mechanism driving this speedup.<n>Muon struggles to uncover common underlying structure across tasks, and is more prone to fitting spurious features.
arXiv Detail & Related papers (2026-02-28T17:37:15Z) - Muon is Provably Faster with Momentum Variance Reduction [55.388203260208485]
Recent empirical research has demonstrated that deep learnings based on the linear linear oracle (LMO) over specifically chosen Non-Eudean.<n>Adam-type training methods outperform the minimization of large language models.
arXiv Detail & Related papers (2025-12-18T14:38:39Z) - NorMuon: Making Muon more efficient and scalable [71.49702449498085]
We propose NorMuon (Neuron-wise Normalized Muon) as a successor to Adam.<n>We show NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting.
arXiv Detail & Related papers (2025-10-07T01:13:41Z) - On Provable Benefits of Muon in Federated Learning [23.850171320924574]
The experiments recently introduced, Muon, has gained increasing attention due to its superior performance across a wide range of applications.<n>This paper investigates this federated performance of Muon in the unexplored setting of the Fedon learning algorithm.
arXiv Detail & Related papers (2025-10-04T16:27:09Z) - Error Feedback for Muon and Friends [80.90330715662961]
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based with rigorous convergence guarantees.<n>Our theory covers non-Euclidean smooth and the more general $(L0, L1)$-smooth setting, matching best-known Euclidean rates and enabling faster convergence under suitable norm choices.
arXiv Detail & Related papers (2025-10-01T08:20:08Z) - Muon Outperforms Adam in Tail-End Associative Memory Learning [118.98991042050532]
We show that Muon consistently achieves balanced learning across classes regardless of feature embeddings.<n>Our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories.
arXiv Detail & Related papers (2025-09-30T10:04:08Z) - LiMuon: Light and Fast Muon Optimizer for Large Models [45.11415579822849]
We propose a useful Muon for training large models.<n>Our LiMuon has a lower memory than the current Muon and its variants.<n>We prove that our LiMuon has a sample $O(epsilon-3)$ under the generalized smooth condition.
arXiv Detail & Related papers (2025-09-18T02:49:27Z) - Lions and Muons: Optimization via Stochastic Frank-Wolfe [11.287482309003334]
We show that Lion and Muon with weight decay can be viewed as special instances a Frank-Wolfe.<n>We also find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint.
arXiv Detail & Related papers (2025-06-04T17:39:03Z) - Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup
for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method.
We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate.
We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z) - A Newton Frank-Wolfe Method for Constrained Self-Concordant Minimization [60.90222082871258]
We demonstrate how to scalably solve a class of constrained self-concordant minimization problems using linear minimization oracles (LMO) over the constraint set.
We prove that the number of LMO calls of our method is nearly the same as that of the Frank-Wolfe method in the L-smooth case.
arXiv Detail & Related papers (2020-02-17T15:28:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.