On the Interplay Between Fine-tuning and Sentence-level Probing for
Linguistic Knowledge in Pre-trained Transformers
- URL: http://arxiv.org/abs/2010.02616v1
- Date: Tue, 6 Oct 2020 10:54:00 GMT
- Title: On the Interplay Between Fine-tuning and Sentence-level Probing for
Linguistic Knowledge in Pre-trained Transformers
- Authors: Marius Mosbach, Anna Khokhlova, Michael A. Hedderich, Dietrich Klakow
- Abstract summary: We study three different pre-trained models: BERT, RoBERTa, and ALBERT.
We find that for some probing tasks fine-tuning leads to substantial changes in accuracy.
While fine-tuning indeed changes the representations of a pre-trained model, only in very few cases, fine-tuning has a positive effect on probing accuracy.
- Score: 24.858283637038422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning pre-trained contextualized embedding models has become an
integral part of the NLP pipeline. At the same time, probing has emerged as a
way to investigate the linguistic knowledge captured by pre-trained models.
Very little is, however, understood about how fine-tuning affects the
representations of pre-trained models and thereby the linguistic knowledge they
encode. This paper contributes towards closing this gap. We study three
different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate
through sentence-level probing how fine-tuning affects their representations.
We find that for some probing tasks fine-tuning leads to substantial changes in
accuracy, possibly suggesting that fine-tuning introduces or even removes
linguistic knowledge from a pre-trained model. These changes, however, vary
greatly across different models, fine-tuning and probing tasks. Our analysis
reveals that while fine-tuning indeed changes the representations of a
pre-trained model and these changes are typically larger for higher layers,
only in very few cases, fine-tuning has a positive effect on probing accuracy
that is larger than just using the pre-trained model with a strong pooling
method. Based on our findings, we argue that both positive and negative effects
of fine-tuning on probing require a careful interpretation.
Related papers
- An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - What Happens During Finetuning of Vision Transformers: An Invariance
Based Investigation [7.432224771219168]
The pretrain-finetune paradigm usually improves downstream performance over training a model from scratch on the same task.
In this work, we examine the relationship between pretrained vision transformers and the corresponding finetuned versions on several benchmark datasets and tasks.
arXiv Detail & Related papers (2023-07-12T08:35:24Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial
Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective.
RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process.
Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based
Masked Language-models [51.53936551681613]
We show that fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model.
They support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
arXiv Detail & Related papers (2021-06-18T16:09:21Z) - What Happens To BERT Embeddings During Fine-tuning? [19.016185902256826]
We investigate how fine-tuning affects the representations of the BERT model.
We find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks.
In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing.
arXiv Detail & Related papers (2020-04-29T19:46:26Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.