Related papers: Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

URL: http://arxiv.org/abs/2507.17702v2
Date: Thu, 24 Jul 2025 07:27:09 GMT
Title: Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
Authors: Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou,
Abstract summary: We introduce Leverage Efficiency (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent.<n>EL is driven by the expert activation ratio and the total compute budget, both following predictable power laws.<n>We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration.
Score: 20.427087561312057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

Related papers

Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources? [58.56306556151929]
Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute.<n>Can MoEs surpass dense architectures under strictly equal resource constraints?<n>We show that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource.
arXiv Detail & Related papers (2025-06-13T17:59:05Z)
Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights [3.8192930334982074]
Fine-grained MoE approaches have demonstrated potential in improving model convergence and quality.<n>This study offers empirical grounding and practical insights for leveraging fine-grained MoE in the development of future large-scale models.
arXiv Detail & Related papers (2025-06-03T13:55:48Z)
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z)
MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models [10.623996218106564]
Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs)<n>We introduce MoLAE, a novel parameterization that reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations.<n>We show that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities.
arXiv Detail & Related papers (2025-03-29T14:35:34Z)
Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining [59.369484219304866]
In this study, we conduct an unprecedented empirical investigationtext- training over 3,700 Large Language Models (LLMs) from scratch across 100 trillion tokens.<n>We empirically observe that, under fixed model size ($N$) and dataset size ($D$), the hyperparameter landscape exhibits convexity with a broad optimum.<n>Building on this insight, we formally define and empirically validate the Step Law: The optimal learning rate follows a power-law relationship with $N$ and $D$, while the optimal batch size is primarily influenced by $D$ and remains largely invariant to $N$.
arXiv Detail & Related papers (2025-03-06T18:58:29Z)
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient [4.34286535607654]
We present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts.<n>Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom.
arXiv Detail & Related papers (2025-02-07T18:55:38Z)
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models [10.517704202614091]
sparse Mixture-of-Experts (MoEs) allow scaling the number of parameters without proportionally increasing the FLOPs per example.<n>We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation.
arXiv Detail & Related papers (2025-01-21T18:51:15Z)
Scaling Laws for Fine-Grained Mixture of Experts [4.412803924115907]
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. We establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity.
arXiv Detail & Related papers (2024-02-12T18:33:47Z)
Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets. We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z)
Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.