Related papers: On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

URL: http://arxiv.org/abs/2006.04884v3
Date: Thu, 25 Mar 2021 07:39:38 GMT
Title: On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Authors: Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow
Abstract summary: Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Previous literature identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. We show that both hypotheses fail to explain the fine-tuning instability.
Score: 31.807628937487927
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches. Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.

Related papers

COME: Test-time adaption by Conservatively Minimizing Entropy [45.689829178140634]
Conservatively Minimize the Entropy (COME) is a drop-in replacement of traditional entropy (EM) COME explicitly models the uncertainty by characterizing a Dirichlet prior distribution over model predictions. We show that COME achieves state-of-the-art performance on commonly used benchmarks.
arXiv Detail & Related papers (2024-10-12T09:20:06Z)
Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data. Existing update gradient would heavily destroy the performance on previous datasets during CIT process. We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z)
Empirical Analysis of Efficient Fine-Tuning Methods for Large Pre-Trained Language Models [4.096453902709292]
BitFit and adapter modules are compared to standard full model fine-tuning. The BitFit approach matches full fine-tuning performance across varying amounts of training data. adapter modules exhibit high variability, with inconsistent gains over default models.
arXiv Detail & Related papers (2024-01-08T17:44:43Z)
Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models [75.9543301303586]
Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data. Fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks. However, we argue that prior work has overlooked the inherent biases in foundation models.
arXiv Detail & Related papers (2023-10-12T08:01:11Z)
Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net Estimation and Optimization [58.90989478049686]
Bi-Drop is a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets. Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods.
arXiv Detail & Related papers (2023-05-24T06:09:26Z)
Towards Stable Test-Time Adaptation in Dynamic Wild World [60.98073673220025]
Test-time adaptation (TTA) has shown to be effective at tackling distribution shifts between training and testing data by adapting a given model on test samples. Online model updating of TTA may be unstable and this is often a key obstacle preventing existing TTA methods from being deployed in the real world.
arXiv Detail & Related papers (2023-02-24T02:03:41Z)
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive. We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)
Noise Stability Regularization for Improving BERT Fine-tuning [94.80511419444723]
Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP tasks. We introduce a novel and effective regularization method to improve fine-tuning on NLP tasks, referred to as Layer-wise Noise Stability Regularization (LNSR) We experimentally confirm that well-performing models show a low sensitivity to noise and fine-tuning with LNSR exhibits clearly higher generalizability and stability.
arXiv Detail & Related papers (2021-07-10T13:19:04Z)
On Robustness and Bias Analysis of BERT-based Relation Extraction [40.64969232497321]
We analyze a fine-tuned BERT model from different perspectives using relation extraction. We find that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases.
arXiv Detail & Related papers (2020-09-14T05:24:28Z)
Elastic weight consolidation for better bias inoculation [24.12790037712358]
Elastic weight consolidation (EWC) allows fine-tuning of models to mitigate biases. EWC dominates standard fine-tuning, yielding models with lower levels of forgetting on the original (biased) dataset.
arXiv Detail & Related papers (2020-04-29T17:45:12Z)
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.