Related papers: FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models

FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models

URL: http://arxiv.org/abs/2602.08818v1
Date: Mon, 09 Feb 2026 15:54:29 GMT
Title: FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models
Authors: Annemette Brok Pirchert, Jacob Nielsen, Mogens Henrik From, Lukas Galke Poech, Peter Schneider-Kamp,
Abstract summary: We introduce FlexMoRE, a flexible mixture of rank-heterogenous experts.<n>We show that the best-performing rank is substantially higher for reasoning-heavy benchmarks than for knowledge-heavy benchmarks.
Score: 3.852094291611636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in mixture-of-experts architectures have shown that individual experts models can be trained federatedly, i.e., in isolation from other experts by using a common base model to facilitate coordination. However, we hypothesize that full-sized experts may not be necessary for all domains and that instead low-rank adapters may be sufficient. Here, we introduce FlexMoRE, a Flexible Mixture of Rank-heterogenous Experts, which may be either full-sized experts or adapters of a suitable rank. We systematically investigate the trade-off between expert rank and downstream task performance by evaluating $6$ experts with ranks $2^0$ to $2^{14}$ resulting in experiments covering 150 mixtures (96 with 2 experts, 54 with 7 experts) that are evaluated across $120$ tasks. For our experiments, we build on FlexOlmo and turn its pre-trained experts into low-rank versions. Our regression analysis from expert rank to downstream task performance reveals that the best-performing rank is substantially higher for reasoning-heavy benchmarks than for knowledge-heavy benchmarks. These findings on rank sensitivity come with direct implications for memory efficiency: Using optimal ranks, FlexMoRE yields improved downstream task performance (average score $47.18$) compared to the baseline FlexOlmo-style mixture of full-sized experts (average score $45.46$) at less than one third the parameters ($10.75$B for FlexMoRE vs. $33.27$B for FlexOlmo). All code will be made available.

Related papers

$\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts [43.075289015406355]
Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance.<n>We propose $infty$-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token.<n>Experiments show that a GPT-2 Small-based $infty$-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters.
arXiv Detail & Related papers (2026-01-25T03:55:51Z)
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts [43.63398524449102]
Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference.<n>We introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead.
arXiv Detail & Related papers (2025-09-26T05:29:19Z)
Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs [54.95810313530111]
DERN is a task-agnostic and retraining-free framework for expert pruning and reconstruction.<n>It improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity.
arXiv Detail & Related papers (2025-09-12T16:09:39Z)
Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging [17.490596264046435]
Sub-MoE is a novel MoE compression framework via Subspace Expert Merging.<n>Our key insight is to perform joint Singular Value Decomposition (SVD) ond expert weights.<n>Our Sub-MoE significantly outperforms existing expert pruning and merging methods.
arXiv Detail & Related papers (2025-06-29T14:43:50Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [26.447210565680116]
We propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts. We show that DeepSeekMoE achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation.
arXiv Detail & Related papers (2024-01-11T17:31:42Z)
Merging Experts into One: Improving Computational Efficiency of Mixture of Experts [71.44422347502409]
A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters. Can we retain the advantages of adding more experts without substantially increasing the computational costs? We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
arXiv Detail & Related papers (2023-10-15T13:28:42Z)
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [84.11508381847929]
Sparsely activated Mixture-of-Experts (SMoE) has shown promise to scale up the learning capacity of neural networks. We propose M-SMoE, which leverages routing statistics to guide expert merging. Our MC-SMoE achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.
arXiv Detail & Related papers (2023-10-02T16:51:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.