Speeding up Deep Model Training by Sharing Weights and Then Unsharing
- URL: http://arxiv.org/abs/2110.03848v1
- Date: Fri, 8 Oct 2021 01:23:34 GMT
- Title: Speeding up Deep Model Training by Sharing Weights and Then Unsharing
- Authors: Shuo Yang, Le Hou, Xiaodan Song, Qiang Liu, Denny Zhou
- Abstract summary: We propose a simple and efficient approach for training the BERT model.
Our approach exploits the special structure of BERT that contains a stack of repeated modules.
- Score: 23.35912133295125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a simple and efficient approach for training the BERT model. Our
approach exploits the special structure of BERT that contains a stack of
repeated modules (i.e., transformer encoders). Our proposed approach first
trains BERT with the weights shared across all the repeated modules till some
point. This is for learning the commonly shared component of weights across all
repeated layers. We then stop weight sharing and continue training until
convergence. We present theoretic insights for training by sharing weights then
unsharing with analysis for simplified models. Empirical experiments on the
BERT model show that our method yields better performance of trained models,
and significantly reduces the number of training iterations.
Related papers
- BEND: Bagging Deep Learning Training Based on Efficient Neural Network Diffusion [56.9358325168226]
We propose a Bagging deep learning training algorithm based on Efficient Neural network Diffusion (BEND)
Our approach is simple but effective, first using multiple trained model weights and biases as inputs to train autoencoder and latent diffusion model.
Our proposed BEND algorithm can consistently outperform the mean and median accuracies of both the original trained model and the diffused model.
arXiv Detail & Related papers (2024-03-23T08:40:38Z) - Fast Propagation is Better: Accelerating Single-Step Adversarial
Training via Sampling Subnetworks [69.54774045493227]
A drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples.
We propose to exploit the interior building blocks of the model to improve efficiency.
Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness.
arXiv Detail & Related papers (2023-10-24T01:36:20Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - CoRe: An Efficient Coarse-refined Training Framework for BERT [17.977099111813644]
We propose a novel coarse-refined training framework named CoRe to speed up the training of BERT.
In the first phase, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT.
In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model.
arXiv Detail & Related papers (2020-11-27T09:49:37Z) - Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models.
We show that the nature of pre-training itself is a performant source of diversity.
We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z) - A Practical Incremental Method to Train Deep CTR Models [37.54660958085938]
We introduce a practical incremental method to train deep CTR models, which consists of three decoupled modules.
Our method can achieve comparable performance to the conventional batch mode training with much better training efficiency.
arXiv Detail & Related papers (2020-09-04T12:35:42Z) - Efficient Learning of Model Weights via Changing Features During
Training [0.0]
We propose a machine learning model, which dynamically changes the features during training.
Our main motivation is to update the model in a small content during the training process with replacing less descriptive features to new ones from a large pool.
arXiv Detail & Related papers (2020-02-21T12:38:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.