Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping
- URL: http://arxiv.org/abs/2002.06305v1
- Date: Sat, 15 Feb 2020 02:40:10 GMT
- Title: Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping
- Authors: Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh
Hajishirzi, Noah Smith
- Abstract summary: Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
- Score: 62.78338049381917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning pretrained contextual word embedding models to supervised
downstream tasks has become commonplace in natural language processing. This
process, however, is often brittle: even with the same hyperparameter values,
distinct random seeds can lead to substantially different results. To better
understand this phenomenon, we experiment with four datasets from the GLUE
benchmark, fine-tuning BERT hundreds of times on each while varying only the
random seeds. We find substantial performance increases compared to previously
reported results, and we quantify how the performance of the best-found model
varies as a function of the number of fine-tuning trials. Further, we examine
two factors influenced by the choice of random seed: weight initialization and
training data order. We find that both contribute comparably to the variance of
out-of-sample performance, and that some weight initializations perform well
across all tasks explored. On small datasets, we observe that many fine-tuning
trials diverge part of the way through training, and we offer best practices
for practitioners to stop training less promising runs early. We publicly
release all of our experimental data, including training and validation scores
for 2,100 trials, to encourage further analysis of training dynamics during
fine-tuning.
Related papers
- A Comparative Study of Pre-training and Self-training [0.40964539027092917]
We propose an ensemble method to empirical study all feasible training paradigms combining pre-training, self-training, and fine-tuning.
We conduct experiments on six datasets, four data augmentation, and imbalanced data for sentiment analysis and natural language inference tasks.
Our findings confirm that the pre-training and fine-tuning paradigm yields the best overall performances.
arXiv Detail & Related papers (2024-09-04T14:30:13Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process.
We conducted a thorough examination of 150 classification datasets.
We find that pre-training can only help improve the optimization process for models that fit the data poorly.
Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z) - Dataset Pruning: Reducing Training Data by Examining Generalization
Influence [30.30255670341501]
Do all training data contribute to model's performance?
How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Bias-Aware Loss for Training Image and Speech Quality Prediction Models
from Multiple Datasets [13.132388683797503]
We propose a bias-aware loss function that estimates each dataset's biases during training with a linear function.
We prove the efficiency of the proposed method by training and validating quality prediction models on synthetic and subjective image and speech quality datasets.
arXiv Detail & Related papers (2021-04-20T19:20:11Z) - Multi-Stage Influence Function [97.19210942277354]
We develop a multi-stage influence function score to track predictions from a finetuned model all the way back to the pretraining data.
We study two different scenarios with the pretrained embeddings fixed or updated in the finetuning tasks.
arXiv Detail & Related papers (2020-07-17T16:03:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.