Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity
- URL: http://arxiv.org/abs/2101.03961v1
- Date: Mon, 11 Jan 2021 16:11:52 GMT
- Title: Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity
- Authors: William Fedus, Barret Zoph, Noam Shazeer
- Abstract summary: We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs.
We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
- Score: 35.84448624327473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In deep learning, models typically reuse the same parameters for all inputs.
Mixture of Experts (MoE) defies this and instead selects different parameters
for each incoming example. The result is a sparsely-activated model -- with
outrageous numbers of parameters -- but a constant computational cost. However,
despite several notable successes of MoE, widespread adoption has been hindered
by complexity, communication costs and training instability -- we address these
with the Switch Transformer. We simplify the MoE routing algorithm and design
intuitive improved models with reduced communication and computational costs.
Our proposed training techniques help wrangle the instabilities and we show
large sparse models may be trained, for the first time, with lower precision
(bfloat16) formats. We design models based off T5-Base and T5-Large to obtain
up to 7x increases in pre-training speed with the same computational resources.
These improvements extend into multilingual settings where we measure gains
over the mT5-Base version across all 101 languages. Finally, we advance the
current scale of language models by pre-training up to trillion parameter
models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the
T5-XXL model.
Related papers
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers.
By treating model parameters as tokens, we replace all the linear projections in Transformers.
Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z) - Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler [34.416299887009195]
We study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler.
We propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size.
Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models.
arXiv Detail & Related papers (2024-08-23T20:22:20Z) - Model-Generated Pretraining Signals Improves Zero-Shot Generalization of
Text-to-Text Transformers [98.30298332661323]
This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5.
We develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks.
Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity.
arXiv Detail & Related papers (2023-05-21T21:06:23Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z) - Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive.
Here we aim to reduce the costs of Transformers by searching for a more efficient variant.
We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z) - Benchmarking down-scaled (not so large) pre-trained language models [0.0]
Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes.
We compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pre-training steps and the batch size.
In our experiments NSP +BERT-style consistently outperforms (RoBERTa-style) as well as the standard LM objective.
arXiv Detail & Related papers (2021-05-11T09:01:04Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.