Related papers: Learning to Grow Pretrained Models for Efficient Transformer Training

Learning to Grow Pretrained Models for Efficient Transformer Training

URL: http://arxiv.org/abs/2303.00980v1
Date: Thu, 2 Mar 2023 05:21:18 GMT
Title: Learning to Grow Pretrained Models for Efficient Transformer Training
Authors: Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim
Abstract summary: We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
Score: 72.20676008625641
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) width- and depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models.

Related papers

Towards smaller, faster decoder-only transformers: Architectural variants and their implications [0.0]
We introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT, LinearGPT, and ConvGPT. These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes.
arXiv Detail & Related papers (2024-04-22T06:19:46Z)
Weight subcloning: direct initialization of transformers using larger pretrained ones [42.056148990349094]
We introduce a technique to transfer the knowledge of a pretrained model to smaller variants. Weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. We achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
arXiv Detail & Related papers (2023-12-14T19:08:56Z)
Reusing Pretrained Models by Multi-linear Operators for Efficient Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources. Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model. We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
On the Transformer Growth for Progressive BERT Training [37.57617077192438]
We find that similar to network architecture search, Transformer growth also favors compound scaling. In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively.
arXiv Detail & Related papers (2020-10-23T17:44:59Z)
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.