Related papers: What Happens To BERT Embeddings During Fine-tuning?

What Happens To BERT Embeddings During Fine-tuning?

URL: http://arxiv.org/abs/2004.14448v1
Date: Wed, 29 Apr 2020 19:46:26 GMT
Title: What Happens To BERT Embeddings During Fine-tuning?
Authors: Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, Ian Tenney
Abstract summary: We investigate how fine-tuning affects the representations of the BERT model. We find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing.
Score: 19.016185902256826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.

Related papers

Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance [68.56701216210617]
In-principle, one would expect models to adapt to the user context better after instruction finetuning. We observe a surprising failure mode: during instruction tuning, the context reliance under knowledge conflicts initially increases as expected, but then gradually decreases.
arXiv Detail & Related papers (2024-10-14T17:57:09Z)
Probing the Category of Verbal Aspect in Transformer Language Models [0.4757470449749875]
We investigate how pretrained language models encode the grammatical category of aspect verbal in Russian. We perform probing using BERT and RoBERTa on alternative and non-alternative contexts. Experiments show that BERT and RoBERTa do encode aspect--mostly in their final layers.
arXiv Detail & Related papers (2024-06-04T14:06:03Z)
Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation [35.72916406365469]
We compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets. Our results show that fine-tuned language models can in fact generalize well out-of-domain.
arXiv Detail & Related papers (2023-05-26T13:55:17Z)
HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation [50.90457644954857]
We propose HyPe, a simple yet effective fine-tuning technique to alleviate problems by perturbing hidden representations of Transformers layers. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers.
arXiv Detail & Related papers (2022-12-17T11:56:21Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models. Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z)
A Closer Look at How Fine-tuning Changes BERT [21.23284793831221]
We study the English BERT family and use two probing techniques to analyze how fine-tuning changes the space. Our experiments reveal that fine-tuning improves performance because it pushes points associated with a label away from other labels. By comparing the representations before and after fine-tuning, we also discover that fine-tuning does not change the representations arbitrarily; instead, it adjusts the representations to downstream tasks while preserving the original structure.
arXiv Detail & Related papers (2021-06-27T17:01:43Z)
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models [51.53936551681613]
We show that fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. They support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
arXiv Detail & Related papers (2021-06-18T16:09:21Z)
Recoding latent sentence representations -- Dynamic gradient-based activation modification in RNNs [0.0]
In RNNs, encoding information in a suboptimal way can impact the quality of representations based on later elements in the sequence. I propose an augmentation to standard RNNs in form of a gradient-based correction mechanism. I conduct different experiments in the context of language modeling, where the impact of using such a mechanism is examined in detail.
arXiv Detail & Related papers (2021-01-03T17:54:17Z)
On the Interplay Between Fine-tuning and Sentence-level Probing for Linguistic Knowledge in Pre-trained Transformers [24.858283637038422]
We study three different pre-trained models: BERT, RoBERTa, and ALBERT. We find that for some probing tasks fine-tuning leads to substantial changes in accuracy. While fine-tuning indeed changes the representations of a pre-trained model, only in very few cases, fine-tuning has a positive effect on probing accuracy.
arXiv Detail & Related papers (2020-10-06T10:54:00Z)
Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented. It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts. Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.