Weight subcloning: direct initialization of transformers using larger
pretrained ones
- URL: http://arxiv.org/abs/2312.09299v1
- Date: Thu, 14 Dec 2023 19:08:56 GMT
- Title: Weight subcloning: direct initialization of transformers using larger
pretrained ones
- Authors: Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja
Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
- Abstract summary: We introduce a technique to transfer the knowledge of a pretrained model to smaller variants.
Weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models.
We achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
- Score: 42.056148990349094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large transformer models from scratch for a target task requires
lots of data and is computationally demanding. The usual practice of transfer
learning overcomes this challenge by initializing the model with weights of a
pretrained model of the same size and specification to increase the convergence
and training speed. However, what if no pretrained model of the required size
is available? In this paper, we introduce a simple yet effective technique to
transfer the knowledge of a pretrained model to smaller variants. Our approach
called weight subcloning expedites the training of scaled-down transformers by
initializing their weights from larger pretrained models.
Weight subcloning involves an operation on the pretrained model to obtain the
equivalent initialized scaled-down model. It consists of two key steps: first,
we introduce neuron importance ranking to decrease the embedding dimension per
layer in the pretrained model. Then, we remove blocks from the transformer
model to match the number of layers in the scaled-down network. The result is a
network ready to undergo training, which gains significant improvements in
training speed compared to random initialization. For instance, we achieve 4x
faster training for vision transformers in image classification and language
models designed for next token prediction.
Related papers
- Efficient Training with Denoised Neural Weights [65.14892033932895]
This work takes a novel step towards building a weight generator to synthesize the neural weights for initialization.
We use the image-to-image translation task with generative adversarial networks (GANs) as an example due to the ease of collecting model weights.
By initializing the image translation model with the denoised weights predicted by our diffusion model, the training requires only 43.3 seconds.
arXiv Detail & Related papers (2024-07-16T17:59:42Z) - Initializing Models with Larger Ones [76.41561758293055]
We introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model.
Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time.
arXiv Detail & Related papers (2023-11-30T18:58:26Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training.
By initializing each stage with the output of the previous one, the training process effectively re-uses the compute.
We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.