GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding
- URL: http://arxiv.org/abs/2006.16668v1
- Date: Tue, 30 Jun 2020 10:42:02 GMT
- Title: GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding
- Authors: Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan
Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
- Abstract summary: We show how to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding.
We demonstrate such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English.
- Score: 46.74457030177477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network scaling has been critical for improving the model quality in
many real-world machine learning applications with vast amounts of training
data and compute. Although this trend of scaling is affirmed to be a sure-fire
approach for better model quality, there are challenges on the path such as the
computation cost, ease of programming, and efficient implementation on parallel
devices. GShard is a module composed of a set of lightweight annotation APIs
and an extension to the XLA compiler. It provides an elegant way to express a
wide range of parallel computation patterns with minimal changes to the
existing model code. GShard enabled us to scale up multilingual neural machine
translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600
billion parameters using automatic sharding. We demonstrate that such a giant
model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to
achieve far superior quality for translation from 100 languages to English
compared to the prior art.
Related papers
- Low-resource neural machine translation with morphological modeling [3.3721926640077804]
Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation.
We propose a framework-solution for modeling complex morphology in low-resource settings.
We evaluate our proposed solution on Kinyarwanda - English translation using public-domain parallel text.
arXiv Detail & Related papers (2024-04-03T01:31:41Z) - DiPaCo: Distributed Path Composition [31.686642863608558]
We propose a co-designed modular architecture and training approach for machine learning models.
During training, DiPaCo distributes by paths through a set of shared modules.
At inference time, only a single path needs to be executed for each input, without the need for model compression.
arXiv Detail & Related papers (2024-03-15T18:26:51Z) - Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models.
Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Automatic Cross-Replica Sharding of Weight Update in Data-Parallel
Training [12.36664837965624]
This paper presents an approach to automatically shard the weight update across replicas.
We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs.
arXiv Detail & Related papers (2020-04-28T07:13:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.