Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A
Large-Scale Generative Language Model
- URL: http://arxiv.org/abs/2201.11990v2
- Date: Mon, 31 Jan 2022 05:25:13 GMT
- Title: Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A
Large-Scale Generative Language Model
- Authors: Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley,
Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George
Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi,
Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston,
Saurabh Tiwary, and Bryan Catanzaro
- Abstract summary: We present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG) with 530 billion parameters.
We show that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results.
- Score: 35.75234515196426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained general-purpose language models can achieve state-of-the-art
accuracies in various natural language processing domains by adapting to
downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of
their success, the size of these models has increased rapidly, requiring
high-performance hardware, software, and algorithmic techniques to enable
training such large models. As the result of a joint effort between Microsoft
and NVIDIA, we present details on the training of the largest monolithic
transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530
billion parameters. In this paper, we first focus on the infrastructure as well
as the 3D parallelism methodology used to train this model using DeepSpeed and
Megatron. Next, we detail the training process, the design of our training
corpus, and our data curation techniques, which we believe is a key ingredient
to the success of the model. Finally, we discuss various evaluation results, as
well as other interesting observations and new properties exhibited by MT-NLG.
We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning
accuracies on several NLP benchmarks and establishes new state-of-the-art
results. We believe that our contributions will help further the development of
large-scale training infrastructures, large-scale language models, and natural
language generations.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Super Tiny Language Models [3.8353434814956517]
This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs)
We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies.
Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
arXiv Detail & Related papers (2024-05-23T04:12:49Z) - Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models.
Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.