Related papers: GRIN: GRadient-INformed MoE

GRIN: GRadient-INformed MoE

URL: http://arxiv.org/abs/2409.12136v1
Date: Wed, 18 Sep 2024 17:00:20 GMT
Title: GRIN: GRadient-INformed MoE
Authors: Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen,
Abstract summary: Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
Score: 132.87651078514122
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16$\times$3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

Related papers

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models [20.427087561312057]
We introduce Leverage Efficiency (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent.<n>EL is driven by the expert activation ratio and the total compute budget, both following predictable power laws.<n>We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration.
arXiv Detail & Related papers (2025-07-23T17:10:23Z)
Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights [3.8192930334982074]
Fine-grained MoE approaches have demonstrated potential in improving model convergence and quality.<n>This study offers empirical grounding and practical insights for leveraging fine-grained MoE in the development of future large-scale models.
arXiv Detail & Related papers (2025-06-03T13:55:48Z)
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs [0.0]
We present SmolTulu, an instruction-tuned language model that adapts AllenAI's Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model. Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios.
arXiv Detail & Related papers (2024-12-11T12:41:36Z)
MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router [55.88046193872355]
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights. Our pruning method is one-shot, requiring no retraining or weight updates.
arXiv Detail & Related papers (2024-10-15T19:22:27Z)
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs) We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z)
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training [13.346719319555943]
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model. Current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. We present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism.
arXiv Detail & Related papers (2023-03-11T05:38:15Z)
Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.