Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers
- URL: http://arxiv.org/abs/2109.10686v1
- Date: Wed, 22 Sep 2021 12:29:15 GMT
- Title: Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers
- Authors: Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar,
Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler
- Abstract summary: This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
- Score: 57.931830650323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There remain many open questions pertaining to the scaling behaviour of
Transformer architectures. These scaling decisions and findings can be
critical, as training runs often come with an associated computational cost
which have both financial and/or environmental impact. The goal of this paper
is to present scaling insights from pretraining and finetuning Transformers.
While Kaplan et al. presents a comprehensive study of the scaling behaviour of
Transformer language models, the scope is only on the upstream (pretraining)
loss. Therefore, it is still unclear if these set of findings transfer to
downstream task within the context of the pretrain-finetune paradigm. The key
findings of this paper are as follows: (1) we show that aside from only the
model size, model shape matters for downstream fine-tuning, (2) scaling
protocols operate differently at different compute regions, (3) widely adopted
T5-base and T5-large sizes are Pareto-inefficient. To this end, we present
improved scaling protocols whereby our redesigned models achieve similar
downstream fine-tuning quality while having 50\% fewer parameters and training
40\% faster compared to the widely adopted T5-base model. We publicly release
over 100 pretrained checkpoints of different T5 configurations to facilitate
future research and analysis.
Related papers
- Unraveling the Mystery of Scaling Laws: Part I [39.967120253159614]
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training.
The original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas.
We provide step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M60M parameters.
arXiv Detail & Related papers (2024-03-11T10:05:29Z) - On Robustness of Finetuned Transformer-based NLP Models [11.063628128069736]
We characterize changes between pretrained and finetuned language model representations across layers using two metrics: CKA and STIR.
GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbations.
This study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models.
arXiv Detail & Related papers (2023-05-23T18:25:18Z) - Model-Generated Pretraining Signals Improves Zero-Shot Generalization of
Text-to-Text Transformers [98.30298332661323]
This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5.
We develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks.
Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity.
arXiv Detail & Related papers (2023-05-21T21:06:23Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time.
We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z) - Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive.
Here we aim to reduce the costs of Transformers by searching for a more efficient variant.
We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z) - Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs.
We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z) - Parameter-Efficient Transfer from Sequential Behaviors for User Modeling
and Recommendation [111.44445634272235]
In this paper, we develop a parameter efficient transfer learning architecture, termed as PeterRec.
PeterRec allows the pre-trained parameters to remain unaltered during fine-tuning by injecting a series of re-learned neural networks.
We perform extensive experimental ablation to show the effectiveness of the learned user representation in five downstream tasks.
arXiv Detail & Related papers (2020-01-13T14:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.