On the Transformer Growth for Progressive BERT Training
- URL: http://arxiv.org/abs/2010.12562v3
- Date: Sun, 11 Jul 2021 06:42:23 GMT
- Title: On the Transformer Growth for Progressive BERT Training
- Authors: Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, Jiawei Han
- Abstract summary: We find that similar to network architecture search, Transformer growth also favors compound scaling.
In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively.
- Score: 37.57617077192438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the excessive cost of large-scale language model pre-training,
considerable efforts have been made to train BERT progressively -- start from
an inferior but low-cost model and gradually grow the model to increase the
computational complexity. Our objective is to advance the understanding of
Transformer growth and discover principles that guide progressive training.
First, we find that similar to network architecture search, Transformer growth
also favors compound scaling. Specifically, while existing methods only conduct
network growth in a single dimension, we observe that it is beneficial to use
compound growth operators and balance multiple dimensions (e.g., depth, width,
and input length of the model). Moreover, we explore alternative growth
operators in each dimension via controlled comparison to give operator
selection practical guidance. In light of our analyses, the proposed method
speeds up BERT pre-training by 73.6% and 82.2% for the base and large models
respectively, while achieving comparable performances
Related papers
- Symmetric Dot-Product Attention for Efficient Training of BERT Language Models [5.838117137253223]
We propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture.
When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation.
arXiv Detail & Related papers (2024-06-10T15:24:15Z) - A Multi-Level Framework for Accelerating Training Transformer Models [5.268960238774481]
Training large-scale deep learning models poses an unprecedented demand for computing power.
We propose a multi-level framework for training acceleration based on Coalescing, De-coalescing and Interpolation.
We prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model.
arXiv Detail & Related papers (2024-04-07T03:04:34Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive
Language-Image Pre-training [78.63699436330165]
Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks.
Online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing.
We propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input.
arXiv Detail & Related papers (2023-08-22T10:07:49Z) - Masked Structural Growth for 2x Faster Language Model Pre-training [18.276784451675603]
We focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one.
In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work.
We propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators.
arXiv Detail & Related papers (2023-05-04T14:28:39Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training.
By initializing each stage with the output of the previous one, the training process effectively re-uses the compute.
We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.