Related papers: Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices

Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices

URL: http://arxiv.org/abs/2508.19078v2
Date: Fri, 10 Oct 2025 03:56:27 GMT
Title: Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
Authors: Fahao Chen, Jie Wan, Peng Li, Zhou Su, Dongxiao Yu,
Abstract summary: Federated fine-tuning of large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants.<n>We propose FLUX, a system designed to enable fine-tuning of MoE-based LLMs across participants with constrained computing resources.<n>FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.
Score: 41.84571097603175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants. Existing working attempts to fill this gap through model quantization, computation offloading, or expert pruning. However, they cannot achieve desired performance due to impractical system assumptions and a lack of consideration for MoE-specific characteristics. In this paper, we propose FLUX, a system designed to enable federated fine-tuning of MoE-based LLMs across participants with constrained computing resources (e.g., consumer-grade GPUs), aiming to minimize time-to-accuracy. FLUX introduces three key innovations: (1) quantization-based local profiling to estimate expert activation with minimal overhead, (2) adaptive layer-aware expert merging to reduce resource consumption while preserving accuracy, and (3) dynamic expert role assignment using an exploration-exploitation strategy to balance tuning and non-tuning experts. Extensive experiments on LLaMA-MoE and DeepSeek-MoE with multiple benchmark datasets demonstrate that FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.

Related papers

A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs [64.8510381475827]
Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently.<n>SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized.<n>We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set.
arXiv Detail & Related papers (2026-02-23T15:11:16Z)
HFedMoE: Resource-aware Heterogeneous Federated Learning with Mixture-of-Experts [26.55877320740609]
We propose HFedMoE, a heterogeneous MoE-based FL fine-tuning framework that customizes a subset of experts to each client.<n> HFedMoE identifies the expert importance based on its contributions to fine-tuning performance.<n>It then adaptively selects a subset of experts from an information bottleneck perspective to align with each client's computing budget.
arXiv Detail & Related papers (2026-01-02T05:56:11Z)
FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment [38.27527504479237]
Mixture-of-Experts (MoE) models enable scalable neural networks through conditional computation.<n>Our approach introduces client-expert fitness scores that quantify the expert suitability for local datasets through training feedback.<n>Our comprehensive experiments on three different datasets demonstrate the superior performance of the proposed FLEX-MoE.
arXiv Detail & Related papers (2025-12-28T20:32:13Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.