On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and
Strong Baselines
- URL: http://arxiv.org/abs/2006.04884v3
- Date: Thu, 25 Mar 2021 07:39:38 GMT
- Title: On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and
Strong Baselines
- Authors: Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow
- Abstract summary: Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.
Previous literature identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets.
We show that both hypotheses fail to explain the fine-tuning instability.
- Score: 31.807628937487927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning pre-trained transformer-based language models such as BERT has
become a common practice dominating leaderboards across various NLP benchmarks.
Despite the strong empirical performance of fine-tuned models, fine-tuning is
an unstable process: training the same model with multiple random seeds can
result in a large variance of the task performance. Previous literature (Devlin
et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential
reasons for the observed instability: catastrophic forgetting and small size of
the fine-tuning datasets. In this paper, we show that both hypotheses fail to
explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT,
fine-tuned on commonly used datasets from the GLUE benchmark, and show that the
observed instability is caused by optimization difficulties that lead to
vanishing gradients. Additionally, we show that the remaining variance of the
downstream task performance can be attributed to differences in generalization
where fine-tuned models with the same training loss exhibit noticeably
different test performance. Based on our analysis, we present a simple but
strong baseline that makes fine-tuning BERT-based models significantly more
stable than the previously proposed approaches. Code to reproduce our results
is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.
Related papers
- COME: Test-time adaption by Conservatively Minimizing Entropy [45.689829178140634]
Conservatively Minimize the Entropy (COME) is a drop-in replacement of traditional entropy (EM)
COME explicitly models the uncertainty by characterizing a Dirichlet prior distribution over model predictions.
We show that COME achieves state-of-the-art performance on commonly used benchmarks.
arXiv Detail & Related papers (2024-10-12T09:20:06Z) - Empirical Analysis of Efficient Fine-Tuning Methods for Large
Pre-Trained Language Models [4.096453902709292]
BitFit and adapter modules are compared to standard full model fine-tuning.
The BitFit approach matches full fine-tuning performance across varying amounts of training data.
adapter modules exhibit high variability, with inconsistent gains over default models.
arXiv Detail & Related papers (2024-01-08T17:44:43Z) - Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models [75.9543301303586]
Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data.
Fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks.
However, we argue that prior work has overlooked the inherent biases in foundation models.
arXiv Detail & Related papers (2023-10-12T08:01:11Z) - Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net
Estimation and Optimization [58.90989478049686]
Bi-Drop is a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets.
Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods.
arXiv Detail & Related papers (2023-05-24T06:09:26Z) - Towards Stable Test-Time Adaptation in Dynamic Wild World [60.98073673220025]
Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples.
Online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world.
arXiv Detail & Related papers (2023-02-24T02:03:41Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - Noise Stability Regularization for Improving BERT Fine-tuning [94.80511419444723]
Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP tasks.
We introduce a novel and effective regularization method to improve fine-tuning on NLP tasks, referred to as Layer-wise Noise Stability Regularization (LNSR)
We experimentally confirm that well-performing models show a low sensitivity to noise and fine-tuning with LNSR exhibits clearly higher generalizability and stability.
arXiv Detail & Related papers (2021-07-10T13:19:04Z) - On Robustness and Bias Analysis of BERT-based Relation Extraction [40.64969232497321]
We analyze a fine-tuned BERT model from different perspectives using relation extraction.
We find that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases.
arXiv Detail & Related papers (2020-09-14T05:24:28Z) - Elastic weight consolidation for better bias inoculation [24.12790037712358]
Elastic weight consolidation (EWC) allows fine-tuning of models to mitigate biases.
EWC dominates standard fine-tuning, yielding models with lower levels of forgetting on the original (biased) dataset.
arXiv Detail & Related papers (2020-04-29T17:45:12Z) - Fine-Tuning Pretrained Language Models: Weight Initializations, Data
Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing.
We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds.
We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.