Related papers: A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension

A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension

URL: http://arxiv.org/abs/2006.01346v1
Date: Tue, 2 Jun 2020 02:12:19 GMT
Title: A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading Comprehension
Authors: Jie Cai, Zhengzhou Zhu, Ping Nie and Qian Liu
Abstract summary: We propose a pairwise probe to understand BERT fine-tuning on the machine reading comprehension (MRC) task. According to pairwise probing tasks, we compare the performance of each layer's hidden representation of pre-trained and fine-tuned BERT. Our experimental analysis leads to highly confident conclusions.
Score: 9.446041739364135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained models have brought significant improvements to many NLP tasks and have been extensively analyzed. But little is known about the effect of fine-tuning on specific tasks. Intuitively, people may agree that a pre-trained model already learns semantic representations of words (e.g. synonyms are closer to each other) and fine-tuning further improves its capabilities which require more complicated reasoning (e.g. coreference resolution, entity boundary detection, etc). However, how to verify these arguments analytically and quantitatively is a challenging task and there are few works focus on this topic. In this paper, inspired by the observation that most probing tasks involve identifying matched pairs of phrases (e.g. coreference requires matching an entity and a pronoun), we propose a pairwise probe to understand BERT fine-tuning on the machine reading comprehension (MRC) task. Specifically, we identify five phenomena in MRC. According to pairwise probing tasks, we compare the performance of each layer's hidden representation of pre-trained and fine-tuned BERT. The proposed pairwise probe alleviates the problem of distraction from inaccurate model training and makes a robust and quantitative comparison. Our experimental analysis leads to highly confident conclusions: (1) Fine-tuning has little effect on the fundamental and low-level information and general semantic tasks. (2) For specific abilities required for downstream tasks, fine-tuned BERT is better than pre-trained BERT and such gaps are obvious after the fifth layer.

Related papers

Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data. We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z)
Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding [52.723297744257536]
Pre-trained language models (LMs) have shown effectiveness in scientific literature understanding tasks. We propose a multi-task contrastive learning framework, SciMult, to facilitate common knowledge sharing across different literature understanding tasks.
arXiv Detail & Related papers (2023-05-23T16:47:22Z)
Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z)
Breakpoint Transformers for Modeling and Tracking Intermediate Beliefs [37.754787051387034]
We propose a representation learning framework called breakpoint modeling. Our approach trains models in an efficient and end-to-end fashion to build intermediate representations. We show the benefit of our main breakpoint transformer, based on T5, over conventional representation learning approaches.
arXiv Detail & Related papers (2022-11-15T07:28:14Z)
Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5 [50.574918785575655]
We compare sequential fine-tuning with a model for multi-task learning in the context of boosting performance on two tasks. Our results show that while sequential multi-task learning can be tuned to be good at the first of two target tasks, it performs less well on the second and additionally struggles with overfitting.
arXiv Detail & Related papers (2022-10-31T13:26:08Z)
Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text Correspondence [45.9949173746044]
We show that large-size pre-trained language models (PLMs) do not satisfy the logical negation property (LNP) We propose a novel intermediate training task, names meaning-matching, designed to directly learn a meaning-text correspondence. We find that the task enables PLMs to learn lexical semantic information.
arXiv Detail & Related papers (2022-05-08T08:37:36Z)
The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders [17.74208462902158]
Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks. We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL. We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL.
arXiv Detail & Related papers (2021-09-14T19:32:11Z)
Weighted Training for Cross-Task Learning [71.94908559469475]
We introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning. We show that TAWT is easy to implement, is computationally efficient, requires little hyper parameter tuning, and enjoys non-asymptotic learning-theoretic guarantees. As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning.
arXiv Detail & Related papers (2021-05-28T20:27:02Z)
Embedding Adaptation is Still Needed for Few-Shot Learning [25.4156194645678]
ATG is a principled clustering method to defining train and test tasksets without additional human knowledge. We empirically demonstrate the effectiveness of ATG in generating tasksets that are easier, in-between, or harder than existing benchmarks. We leverage our generated tasksets to shed a new light on few-shot classification: gradient-based methods can outperform metric-based ones when transfer is most challenging.
arXiv Detail & Related papers (2021-04-15T06:00:04Z)
ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text. Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z)
Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining. We distill the approximate marginal distribution over words in context from the syntactic LM. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.