HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
- URL: http://arxiv.org/abs/2601.00583v1
- Date: Fri, 02 Jan 2026 05:56:11 GMT
- Title: HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
- Authors: Zihan Fang, Zheng Lin, Senkang Hu, Yanan Ma, Yihang Tao, Yiqin Deng, Xianhao Chen, Yuguang Fang,
- Abstract summary: We propose HFedMoE, a heterogeneous MoE-based FL fine-tuning framework that customizes a subset of experts to each client.<n> HFedMoE identifies the expert importance based on its contributions to fine-tuning performance.<n>It then adaptively selects a subset of experts from an information bottleneck perspective to align with each client's computing budget.
- Score: 26.55877320740609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While federated learning (FL) enables fine-tuning of large language models (LLMs) without compromising data privacy, the substantial size of an LLM renders on-device training impractical for resource-constrained clients, such as mobile devices. Thus, Mixture-of-Experts (MoE) models have emerged as a computation-efficient solution, which activates only a sparse subset of experts during model training to reduce computing burden without sacrificing performance. Though integrating MoE into FL fine-tuning holds significant potential, it still encounters three key challenges: i) selecting appropriate experts for clients remains challenging due to the lack of a reliable metric to measure each expert's impact on local fine-tuning performance, ii) the heterogeneous computing resources across clients severely hinder MoE-based LLM fine-tuning, as dynamic expert activations across diverse input samples can overwhelm resource-constrained devices, and iii) client-specific expert subsets and routing preference undermine global aggregation, where misaligned expert updates and inconsistent gating networks in troduce destructive interference. To address these challenges, we propose HFedMoE, a heterogeneous MoE-based FL fine-tuning framework that customizes a subset of experts to each client for computation-efficient LLM fine-tuning. Specifically, HFedMoE identifies the expert importance based on its contributions to fine-tuning performance, and then adaptively selects a subset of experts from an information bottleneck perspective to align with each client' s computing budget. A sparsity-aware model aggregation strategy is also designed to aggregate the actively fine-tuned experts and gating parameters with importance weighted contributions. Extensive experiments demonstrate that HFedMoE outperforms state-of-the-art benchmarks in training accuracy and convergence speed.
Related papers
- Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z) - FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment [38.27527504479237]
Mixture-of-Experts (MoE) models enable scalable neural networks through conditional computation.<n>Our approach introduces client-expert fitness scores that quantify the expert suitability for local datasets through training feedback.<n>Our comprehensive experiments on three different datasets demonstrate the superior performance of the proposed FLEX-MoE.
arXiv Detail & Related papers (2025-12-28T20:32:13Z) - Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices [41.84571097603175]
Federated fine-tuning of large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants.<n>We propose FLUX, a system designed to enable fine-tuning of MoE-based LLMs across participants with constrained computing resources.<n>FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.
arXiv Detail & Related papers (2025-08-26T14:39:00Z) - Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation [56.36237936346563]
Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks.<n>Due to data privacy regulations, cloud-based FMs cannot directly access private edge data.<n>We introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data.<n>Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.
arXiv Detail & Related papers (2025-08-22T17:47:02Z) - Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach [52.79991638077892]
This article highlights a critical, yet underexplored concept: the absence of robust quantitative strategies for dynamic client-expert alignment.<n>We propose a conceptual system design for intelligent client-expert alignment that incorporates dynamic fitness scoring, global expert load monitoring, and client capacity profiling.
arXiv Detail & Related papers (2025-07-08T05:30:37Z) - FLAME: Towards Federated Fine-Tuning Large Language Models Through Adaptive SMoE [21.860699562235776]
FLAME is a novel federated learning framework based on the Sparse Mixture-of-Experts (SMoE) architecture.<n>It retains full (uncompressed) global LoRA matrices and achieves client-side adaptability by varying the number of activated experts per client.<n>It tackles these challenges through a lightweight rescaling mechanism and an activation-aware aggregation scheme.
arXiv Detail & Related papers (2025-06-19T21:02:19Z) - DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [86.76714527437383]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z) - FedMHO: Heterogeneous One-Shot Federated Learning Towards Resource-Constrained Edge Devices [12.08958206272527]
Federated Learning (FL) is increasingly adopted in edge computing scenarios, where a large number of heterogeneous clients operate under constrained or sufficient resources.<n>One-shot FL has emerged as a promising approach to mitigate communication overhead, and model-heterogeneous FL solves the problem of diverse computing resources across clients.<n>We propose a novel FL framework named FedMHO, which leverages deep classification models on resource-sufficient clients and lightweight generative models on resource-constrained devices.
arXiv Detail & Related papers (2025-02-12T15:54:56Z) - Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.<n>This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.