Related papers: $\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts

$\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts

URL: http://arxiv.org/abs/2601.17680v1
Date: Sun, 25 Jan 2026 03:55:51 GMT
Title: $\infty$-MoE: Generalizing Mixture of Experts to Infinite Experts
Authors: Shota Takashiro, Takeshi Kojima, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo,
Abstract summary: Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance.<n>We propose $infty$-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token.<n>Experiments show that a GPT-2 Small-based $infty$-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters.
Score: 43.075289015406355
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Mixture of Experts (MoE) selects a few feed-forward networks (FFNs) per token, achieving an effective trade-off between computational cost and performance. In conventional MoE, each expert is treated as entirely independent, and experts are combined in a discrete space. As a result, when the number of experts increases, it becomes difficult to train each expert effectively. To stabilize training while increasing the number of experts, we propose $\infty$-MoE that selects a portion of the parameters of large FFNs based on continuous values sampled for each token. By considering experts in a continuous space, this approach allows for an infinite number of experts while maintaining computational efficiency. Experiments show that a GPT-2 Small-based $\infty$-MoE model, with 129M active and 186M total parameters, achieves comparable performance to a dense GPT-2 Medium with 350M parameters. Adjusting the number of sampled experts at inference time allows for a flexible trade-off between accuracy and speed, with an improvement of up to 2.5\% in accuracy over conventional MoE.

Related papers

FlexMoRE: A Flexible Mixture of Rank-heterogeneous Experts for Efficient Federatedly-trained Large Language Models [3.852094291611636]
We introduce FlexMoRE, a flexible mixture of rank-heterogenous experts.<n>We show that the best-performing rank is substantially higher for reasoning-heavy benchmarks than for knowledge-heavy benchmarks.
arXiv Detail & Related papers (2026-02-09T15:54:29Z)
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts [43.63398524449102]
Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference.<n>We introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead.
arXiv Detail & Related papers (2025-09-26T05:29:19Z)
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations [48.890534958441016]
This study investigates domain specialization and expert redundancy in large-scale MoE models.<n>We propose a simple yet effective pruning framework, EASY-EP, to identify and retain only the most relevant experts.<n>Experiments on DeepSeek-R1 and DeepSeek-V3-0324 show that our method can achieve comparable performances and $2.99times$ throughput under the same memory budget with full model with only half the experts.
arXiv Detail & Related papers (2025-04-09T11:34:06Z)
Unified Sparse Mixture of Experts [14.774596844618396]
Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead.<n>This paper proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations.
arXiv Detail & Related papers (2025-03-29T07:15:12Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
Beyond Parameter Count: Implicit Bias in Soft Mixture of Experts [44.09546603624385]
We introduce a notion of expert specialization for Soft MoE. We show that when there are many small experts, the architecture is implicitly biased in a fashion that allows us to efficiently approximate the specialized expert subset.
arXiv Detail & Related papers (2024-09-02T00:39:00Z)
HMoE: Heterogeneous Mixture of Experts for Language Modeling [45.65121689677227]
Traditionally, Mixture of Experts (MoE) models use homogeneous experts, each with identical capacity. We propose a novel Heterogeneous Mixture of Experts (HMoE) where experts differ in size and thus possess diverse capacities. HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks.
arXiv Detail & Related papers (2024-08-20T09:35:24Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
Merging Experts into One: Improving Computational Efficiency of Mixture of Experts [71.44422347502409]
A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters. Can we retain the advantages of adding more experts without substantially increasing the computational costs? We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
arXiv Detail & Related papers (2023-10-15T13:28:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.