FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
- URL: http://arxiv.org/abs/2510.27403v1
- Date: Fri, 31 Oct 2025 11:41:16 GMT
- Title: FedMuon: Accelerating Federated Learning with Matrix Orthogonalization
- Authors: Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, Jin Liu,
- Abstract summary: The core bottleneck of Federated Learning (FL) lies in the communication rounds.<n>Existing FL methods still primarily use element-wise locals (Adam/SGD), neglecting the geometric structure of the weight matrices.<n>We propose a novel Federated Muon (FedMuon), which incorporates two key techniques.
- Score: 27.47081354557656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \texttt{FedMuon} achieves a linear speedup convergence rate without the heterogeneity assumption, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.
Related papers
- Asynchronous Federated Stochastic Optimization for Heterogeneous Objectives Under Arbitrary Delays [1.9766522384767224]
underlineAsynchunderlineRonous underlineExact underlineAveraging (textscAREA)<n>textscAREA communicates model residuals rather than gradient estimates, reducing exposure to gradient inversion.<n>We are first to obtain rates that scale with the average client update frequency rather than the minimum or maximum, indicating increased robustness to outliers.
arXiv Detail & Related papers (2024-05-16T14:22:49Z) - Achieving Linear Speedup in Asynchronous Federated Learning with
Heterogeneous Clients [30.135431295658343]
Federated learning (FL) aims to learn a common global model without exchanging or transferring the data that are stored locally at different clients.
In this paper, we propose an efficient federated learning (AFL) framework called DeFedAvg.
DeFedAvg is the first AFL algorithm that achieves the desirable linear speedup property, which indicates its high scalability.
arXiv Detail & Related papers (2024-02-17T05:22:46Z) - DFedADMM: Dual Constraints Controlled Model Inconsistency for
Decentralized Federated Learning [52.83811558753284]
Decentralized learning (DFL) discards the central server and establishes a decentralized communication network.
Existing DFL methods still suffer from two major challenges: local inconsistency and local overfitting.
arXiv Detail & Related papers (2023-08-16T11:22:36Z) - FedSpeed: Larger Local Interval, Less Communication Round, and Higher
Generalization Accuracy [84.45004766136663]
Federated learning is an emerging distributed machine learning framework.
It suffers from the non-vanishing biases introduced by the local inconsistent optimal and the rugged client-drifts by the local over-fitting.
We propose a novel and practical method, FedSpeed, to alleviate the negative impacts posed by these problems.
arXiv Detail & Related papers (2023-02-21T03:55:29Z) - Improving the Model Consistency of Decentralized Federated Learning [68.2795379609854]
Federated Learning (FL) discards the central server and each client only communicates with its neighbors in a decentralized communication network.
Existing DFL suffers from inconsistency among local clients, which results in inferior compared to FLFL.
We propose DFedSAMMGS, where $1lambda$ is the spectral gossip matrix and $Q$ is the number of sparse data gaps.
arXiv Detail & Related papers (2023-02-08T14:37:34Z) - FedFOR: Stateless Heterogeneous Federated Learning with First-Order
Regularization [24.32029125031383]
Federated Learning (FL) seeks to distribute model training across local clients without collecting data in a centralized data-center.
We propose a first-order approximation of the global data distribution into local objectives, which intuitively penalizes updates in the opposite direction of the global update.
Our approach does not impose unrealistic limits on the client size, enabling learning from a large number of clients as is typical in most FL applications.
arXiv Detail & Related papers (2022-09-21T17:57:20Z) - AdaBest: Minimizing Client Drift in Federated Learning via Adaptive Bias
Estimation [12.62716075696359]
In Federated Learning (FL), a number of clients or devices collaborate to train a model without sharing their data.
In order to estimate and therefore remove this drift, variance reduction techniques have been incorporated into FL optimization recently.
We propose an adaptive algorithm that accurately estimates drift across clients.
arXiv Detail & Related papers (2022-04-27T20:04:24Z) - FedDC: Federated Learning with Non-IID Data via Local Drift Decoupling
and Correction [48.85303253333453]
Federated learning (FL) allows multiple clients to collectively train a high-performance global model without sharing their private data.
We propose a novel federated learning algorithm with local drift decoupling and correction (FedDC)
Our FedDC only introduces lightweight modifications in the local training phase, in which each client utilizes an auxiliary local drift variable to track the gap between the local model parameter and the global model parameters.
Experiment results and analysis demonstrate that FedDC yields expediting convergence and better performance on various image classification tasks.
arXiv Detail & Related papers (2022-03-22T14:06:26Z) - Faster Non-Convex Federated Learning via Global and Local Momentum [57.52663209739171]
textttFedGLOMO is the first (first-order) FLtexttFedGLOMO algorithm.
Our algorithm is provably optimal even with communication between the clients and the server.
arXiv Detail & Related papers (2020-12-07T21:05:31Z) - Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning [102.26119328920547]
Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients.
We propose a general algorithmic framework, Mime, which mitigates client drift and adapts arbitrary centralized optimization algorithms.
arXiv Detail & Related papers (2020-08-08T21:55:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.