Learning to Grow Pretrained Models for Efficient Transformer Training
- URL: http://arxiv.org/abs/2303.00980v1
- Date: Thu, 2 Mar 2023 05:21:18 GMT
- Title: Learning to Grow Pretrained Models for Efficient Transformer Training
- Authors: Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard,
Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim
- Abstract summary: We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
- Score: 72.20676008625641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling transformers has led to significant breakthroughs in many domains,
leading to a paradigm in which larger versions of existing models are trained
and released on a periodic basis. New instances of such models are typically
trained completely from scratch, despite the fact that they are often just
scaled-up versions of their smaller counterparts. How can we use the implicit
knowledge in the parameters of smaller, extant models to enable faster training
of newer, larger models? This paper describes an approach for accelerating
transformer training by learning to grow pretrained transformers, where we
learn to linearly map the parameters of the smaller model to initialize the
larger model. For tractable learning, we factorize the linear transformation as
a composition of (linear) width- and depth-growth operators, and further employ
a Kronecker factorization of these growth operators to encode architectural
knowledge. Extensive experiments across both language and vision transformers
demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50%
computational cost of training from scratch, while also consistently
outperforming strong baselines that also reuse smaller pretrained models to
initialize larger models.
Related papers
- Towards smaller, faster decoder-only transformers: Architectural variants and their implications [0.0]
We introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT, LinearGPT, and ConvGPT.
These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes.
arXiv Detail & Related papers (2024-04-22T06:19:46Z) - Weight subcloning: direct initialization of transformers using larger
pretrained ones [42.056148990349094]
We introduce a technique to transfer the knowledge of a pretrained model to smaller variants.
Weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models.
We achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
arXiv Detail & Related papers (2023-12-14T19:08:56Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - On the Transformer Growth for Progressive BERT Training [37.57617077192438]
We find that similar to network architecture search, Transformer growth also favors compound scaling.
In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively.
arXiv Detail & Related papers (2020-10-23T17:44:59Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.