Fast-ELECTRA for Efficient Pre-training
- URL: http://arxiv.org/abs/2310.07347v1
- Date: Wed, 11 Oct 2023 09:55:46 GMT
- Title: Fast-ELECTRA for Efficient Pre-training
- Authors: Chengyu Dong, Liyuan Liu, Hao Cheng, Jingbo Shang, Jianfeng Gao,
Xiaodong Liu
- Abstract summary: ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model.
We propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model.
Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model.
- Score: 83.29484808667532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ELECTRA pre-trains language models by detecting tokens in a sequence that
have been replaced by an auxiliary model. Although ELECTRA offers a significant
boost in efficiency, its potential is constrained by the training cost brought
by the auxiliary model. Notably, this model, which is jointly trained with the
main model, only serves to assist the training of the main model and is
discarded post-training. This results in a substantial amount of training cost
being expended in vain. To mitigate this issue, we propose Fast-ELECTRA, which
leverages an existing language model as the auxiliary model. To construct a
learning curriculum for the main model, we smooth its output distribution via
temperature scaling following a descending schedule. Our approach rivals the
performance of state-of-the-art ELECTRA-style pre-training methods, while
significantly eliminating the computation and memory cost brought by the joint
training of the auxiliary model. Our method also reduces the sensitivity to
hyper-parameters and enhances the pre-training stability.
Related papers
- Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning [6.1339395157466425]
Foundational deep learning (DL) models are general models, trained on diverse, diverse, and unlabelled datasets.
We introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals.
arXiv Detail & Related papers (2024-11-14T23:56:57Z) - Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA [15.542668474378633]
We propose a novel and efficient machine unlearning method on pre-trained models.
We leverage LoRA to decompose the model's intermediate features into pre-trained features and residual features.
The method aims to learn the zero residuals on the retained set and shifted residuals on the unlearning set.
arXiv Detail & Related papers (2024-11-13T08:56:35Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - Preparing Lessons for Progressive Training on Language Models [75.88952808979087]
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions.
We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers.
Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
arXiv Detail & Related papers (2024-01-17T13:04:14Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Your Autoregressive Generative Model Can be Better If You Treat It as an
Energy-Based One [83.5162421521224]
We propose a unique method termed E-ARM for training autoregressive generative models.
E-ARM takes advantage of a well-designed energy-based learning objective.
We show that E-ARM can be trained efficiently and is capable of alleviating the exposure bias problem.
arXiv Detail & Related papers (2022-06-26T10:58:41Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Maximizing Efficiency of Language Model Pre-training for Learning
Representation [6.518508607788086]
ELECTRA is a novel approach for improving the compute efficiency of pre-trained language models.
Our work proposes adaptive early exit strategy to maximize the efficiency of the pre-training process.
arXiv Detail & Related papers (2021-10-13T10:25:06Z) - SAS: Self-Augmented Strategy for Language Model Pre-training [31.69657711092598]
Most data augmentations in language model pre-training are context-independent.
We propose a self-augmented strategy (SAS) that uses a single forward pass through the model to augment the input data for model training in the next epoch.
Our SAS is able to outperform the ELECTRA and other state-of-the-art models in the computation GLUE tasks with the same or less cost.
arXiv Detail & Related papers (2021-06-14T05:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.