LEMON: Lossless model expansion
- URL: http://arxiv.org/abs/2310.07999v1
- Date: Thu, 12 Oct 2023 03:02:41 GMT
- Title: LEMON: Lossless model expansion
- Authors: Yite Wang, Jiahao Su, Hanlin Lu, Cong Xie, Tianyi Liu, Jianbo Yuan,
Haibin Lin, Ruoyu Sun, Hongxia Yang
- Abstract summary: Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance.
We present $textbfL$ossl$textbfE$ss $textbfMO$del Expansio$textbfN$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts.
We show that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
- Score: 43.40389747029802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling of deep neural networks, especially Transformers, is pivotal for
their surging performance and has further led to the emergence of sophisticated
reasoning capabilities in foundation models. Such scaling generally requires
training large models from scratch with random initialization, failing to
leverage the knowledge acquired by their smaller counterparts, which are
already resource-intensive to obtain. To tackle this inefficiency, we present
$\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a
recipe to initialize scaled models using the weights of their smaller but
pre-trained counterparts. This is followed by model training with an optimized
learning rate scheduler tailored explicitly for the scaled models,
substantially reducing the training time compared to training from scratch.
Notably, LEMON is versatile, ensuring compatibility with various network
structures, including models like Vision Transformers and BERT. Our empirical
results demonstrate that LEMON reduces computational costs by 56.7% for Vision
Transformers and 33.2% for BERT when compared to training from scratch.
Related papers
- FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models [35.40065954148091]
FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models.
It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes''
It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
arXiv Detail & Related papers (2024-09-28T08:57:17Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z) - Alternate Model Growth and Pruning for Efficient Training of
Recommendation Systems [7.415129876303651]
Model pruning is an effective technique to reduce computation overhead for deep neural networks by removing redundant parameters.
Modern recommendation systems are still thirsty for model capacity due to the demand for handling big data.
We propose a dynamic training scheme, namely alternate model growth and pruning, to alternatively construct and prune weights in the course of training.
arXiv Detail & Related papers (2021-05-04T03:14:30Z) - Performance of Transfer Learning Model vs. Traditional Neural Network in
Low System Resource Environment [0.0]
We will compare the performance and cost between lighter transfer learning model and purposely built neural network for NLP application of text classification and NER model.
The rise of state-of-the-art models such as BERT, XLNet and GPT boost accuracy and benefit as a base model for transfer leanring.
arXiv Detail & Related papers (2020-10-20T08:12:56Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.