A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading
Comprehension
- URL: http://arxiv.org/abs/2006.01346v1
- Date: Tue, 2 Jun 2020 02:12:19 GMT
- Title: A Pairwise Probe for Understanding BERT Fine-Tuning on Machine Reading
Comprehension
- Authors: Jie Cai, Zhengzhou Zhu, Ping Nie and Qian Liu
- Abstract summary: We propose a pairwise probe to understand BERT fine-tuning on the machine reading comprehension (MRC) task.
According to pairwise probing tasks, we compare the performance of each layer's hidden representation of pre-trained and fine-tuned BERT.
Our experimental analysis leads to highly confident conclusions.
- Score: 9.446041739364135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained models have brought significant improvements to many NLP tasks
and have been extensively analyzed. But little is known about the effect of
fine-tuning on specific tasks. Intuitively, people may agree that a pre-trained
model already learns semantic representations of words (e.g. synonyms are
closer to each other) and fine-tuning further improves its capabilities which
require more complicated reasoning (e.g. coreference resolution, entity
boundary detection, etc). However, how to verify these arguments analytically
and quantitatively is a challenging task and there are few works focus on this
topic. In this paper, inspired by the observation that most probing tasks
involve identifying matched pairs of phrases (e.g. coreference requires
matching an entity and a pronoun), we propose a pairwise probe to understand
BERT fine-tuning on the machine reading comprehension (MRC) task. Specifically,
we identify five phenomena in MRC. According to pairwise probing tasks, we
compare the performance of each layer's hidden representation of pre-trained
and fine-tuned BERT. The proposed pairwise probe alleviates the problem of
distraction from inaccurate model training and makes a robust and quantitative
comparison. Our experimental analysis leads to highly confident conclusions:
(1) Fine-tuning has little effect on the fundamental and low-level information
and general semantic tasks. (2) For specific abilities required for downstream
tasks, fine-tuned BERT is better than pre-trained BERT and such gaps are
obvious after the fifth layer.
Related papers
- Pre-training Multi-task Contrastive Learning Models for Scientific
Literature Understanding [52.723297744257536]
Pre-trained language models (LMs) have shown effectiveness in scientific literature understanding tasks.
We propose a multi-task contrastive learning framework, SciMult, to facilitate common knowledge sharing across different literature understanding tasks.
arXiv Detail & Related papers (2023-05-23T16:47:22Z) - Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - Breakpoint Transformers for Modeling and Tracking Intermediate Beliefs [37.754787051387034]
We propose a representation learning framework called breakpoint modeling.
Our approach trains models in an efficient and end-to-end fashion to build intermediate representations.
We show the benefit of our main breakpoint transformer, based on T5, over conventional representation learning approaches.
arXiv Detail & Related papers (2022-11-15T07:28:14Z) - Effective Cross-Task Transfer Learning for Explainable Natural Language
Inference with T5 [50.574918785575655]
We compare sequential fine-tuning with a model for multi-task learning in the context of boosting performance on two tasks.
Our results show that while sequential multi-task learning can be tuned to be good at the first of two target tasks, it performs less well on the second and additionally struggles with overfitting.
arXiv Detail & Related papers (2022-10-31T13:26:08Z) - Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text
Correspondence [45.9949173746044]
We show that large-size pre-trained language models (PLMs) do not satisfy the logical negation property (LNP)
We propose a novel intermediate training task, names meaning-matching, designed to directly learn a meaning-text correspondence.
We find that the task enables PLMs to learn lexical semantic information.
arXiv Detail & Related papers (2022-05-08T08:37:36Z) - The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with
Transformer Encoders [17.74208462902158]
Multi-task learning with transformer encoders (MTL) has emerged as a powerful technique to improve performance on closely-related tasks.
We first present MTL results on five NLP tasks, POS, NER, DEP, CON, and SRL.
We then conduct an extensive pruning analysis to show that a certain set of attention heads get claimed by most tasks during MTL.
arXiv Detail & Related papers (2021-09-14T19:32:11Z) - Weighted Training for Cross-Task Learning [71.94908559469475]
We introduce Target-Aware Weighted Training (TAWT), a weighted training algorithm for cross-task learning.
We show that TAWT is easy to implement, is computationally efficient, requires little hyper parameter tuning, and enjoys non-asymptotic learning-theoretic guarantees.
As a byproduct, the proposed representation-based task distance allows one to reason in a theoretically principled way about several critical aspects of cross-task learning.
arXiv Detail & Related papers (2021-05-28T20:27:02Z) - Embedding Adaptation is Still Needed for Few-Shot Learning [25.4156194645678]
ATG is a principled clustering method to defining train and test tasksets without additional human knowledge.
We empirically demonstrate the effectiveness of ATG in generating tasksets that are easier, in-between, or harder than existing benchmarks.
We leverage our generated tasksets to shed a new light on few-shot classification: gradient-based methods can outperform metric-based ones when transfer is most challenging.
arXiv Detail & Related papers (2021-04-15T06:00:04Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.