Llama 3 Meets MoE: Efficient Upcycling
- URL: http://arxiv.org/abs/2412.09952v1
- Date: Fri, 13 Dec 2024 08:22:19 GMT
- Title: Llama 3 Meets MoE: Efficient Upcycling
- Authors: Aditya Vavre, Ethan He, Dennis Liu, Zijie Yan, June Yang, Nima Tajbakhsh, Ashwath Aithal,
- Abstract summary: We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1%$ of typical pre-training compute.
Our approach enhances downstream performance on academic benchmarks, achieving a $textbf2%$ improvement in 0-shot accuracy on MMLU.
We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.
- Score: 1.8337958765930928
- License:
- Abstract: Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.
Related papers
- 2 OLMo 2 Furious [126.72656187302502]
OLMo 2 includes dense autoregressive models with improved architecture and training recipe.
Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124.
Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size.
arXiv Detail & Related papers (2024-12-31T21:55:10Z) - GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing.
We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing.
Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z) - AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies [36.645912291368546]
We present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model with 8 experts with 16 billion parameters each.
This approach optimize performance while minimizing data requirements through a two-stage process.
We successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
arXiv Detail & Related papers (2024-08-13T02:07:00Z) - LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training [45.97480866595295]
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant.
We adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range.
arXiv Detail & Related papers (2024-05-23T21:00:53Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs)
We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training.
We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.