Related papers: HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts

HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts

URL: http://arxiv.org/abs/2601.00583v1
Date: Fri, 02 Jan 2026 05:56:11 GMT
Title: HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts
Authors: Zihan Fang, Zheng Lin, Senkang Hu, Yanan Ma, Yihang Tao, Yiqin Deng, Xianhao Chen, Yuguang Fang,
Abstract summary: We propose HFedMoE, a heterogeneous MoE-based FL fine-tuning framework that customizes a subset of experts to each client.<n> HFedMoE identifies the expert importance based on its contributions to fine-tuning performance.<n>It then adaptively selects a subset of experts from an information bottleneck perspective to align with each client's computing budget.
Score: 26.55877320740609
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While federated learning (FL) enables fine-tuning of large language models (LLMs) without compromising data privacy, the substantial size of an LLM renders on-device training impractical for resource-constrained clients, such as mobile devices. Thus, Mixture-of-Experts (MoE) models have emerged as a computation-efficient solution, which activates only a sparse subset of experts during model training to reduce computing burden without sacrificing performance. Though integrating MoE into FL fine-tuning holds significant potential, it still encounters three key challenges: i) selecting appropriate experts for clients remains challenging due to the lack of a reliable metric to measure each expert's impact on local fine-tuning performance, ii) the heterogeneous computing resources across clients severely hinder MoE-based LLM fine-tuning, as dynamic expert activations across diverse input samples can overwhelm resource-constrained devices, and iii) client-specific expert subsets and routing preference undermine global aggregation, where misaligned expert updates and inconsistent gating networks in troduce destructive interference. To address these challenges, we propose HFedMoE, a heterogeneous MoE-based FL fine-tuning framework that customizes a subset of experts to each client for computation-efficient LLM fine-tuning. Specifically, HFedMoE identifies the expert importance based on its contributions to fine-tuning performance, and then adaptively selects a subset of experts from an information bottleneck perspective to align with each client' s computing budget. A sparsity-aware model aggregation strategy is also designed to aggregate the actively fine-tuned experts and gating parameters with importance weighted contributions. Extensive experiments demonstrate that HFedMoE outperforms state-of-the-art benchmarks in training accuracy and convergence speed.

Related papers

Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z)
FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment [38.27527504479237]
Mixture-of-Experts (MoE) models enable scalable neural networks through conditional computation.<n>Our approach introduces client-expert fitness scores that quantify the expert suitability for local datasets through training feedback.<n>Our comprehensive experiments on three different datasets demonstrate the superior performance of the proposed FLEX-MoE.
arXiv Detail & Related papers (2025-12-28T20:32:13Z)
Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices [41.84571097603175]
Federated fine-tuning of large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants.<n>We propose FLUX, a system designed to enable fine-tuning of MoE-based LLMs across participants with constrained computing resources.<n>FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.
arXiv Detail & Related papers (2025-08-26T14:39:00Z)
Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation [56.36237936346563]
Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks.<n>Due to data privacy regulations, cloud-based FMs cannot directly access private edge data.<n>We introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data.<n>Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.
arXiv Detail & Related papers (2025-08-22T17:47:02Z)
Efficient Training of Large-Scale AI Models Through Federated Mixture-of-Experts: A System-Level Approach [52.79991638077892]
This article highlights a critical, yet underexplored concept: the absence of robust quantitative strategies for dynamic client-expert alignment.<n>We propose a conceptual system design for intelligent client-expert alignment that incorporates dynamic fitness scoring, global expert load monitoring, and client capacity profiling.
arXiv Detail & Related papers (2025-07-08T05:30:37Z)
FLAME: Towards Federated Fine-Tuning Large Language Models Through Adaptive SMoE [21.860699562235776]
FLAME is a novel federated learning framework based on the Sparse Mixture-of-Experts (SMoE) architecture.<n>It retains full (uncompressed) global LoRA matrices and achieves client-side adaptability by varying the number of activated experts per client.<n>It tackles these challenges through a lightweight rescaling mechanism and an activation-aware aggregation scheme.
arXiv Detail & Related papers (2025-06-19T21:02:19Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [86.76714527437383]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
FedMHO: Heterogeneous One-Shot Federated Learning Towards Resource-Constrained Edge Devices [12.08958206272527]
Federated Learning (FL) is increasingly adopted in edge computing scenarios, where a large number of heterogeneous clients operate under constrained or sufficient resources.<n>One-shot FL has emerged as a promising approach to mitigate communication overhead, and model-heterogeneous FL solves the problem of diverse computing resources across clients.<n>We propose a novel FL framework named FedMHO, which leverages deep classification models on resource-sufficient clients and lightweight generative models on resource-constrained devices.
arXiv Detail & Related papers (2025-02-12T15:54:56Z)
Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.<n>This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.