Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation
- URL: http://arxiv.org/abs/2002.10345v1
- Date: Mon, 24 Feb 2020 16:17:12 GMT
- Title: Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation
- Authors: Yige Xu, Xipeng Qiu, Ligao Zhou, Xuanjing Huang
- Abstract summary: Fine-tuning pre-trained language models like BERT has become an effective way in NLP.
In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation.
- Score: 84.64004917951547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning pre-trained language models like BERT has become an effective way
in NLP and yields state-of-the-art results on many downstream tasks. Recent
studies on adapting BERT to new tasks mainly focus on modifying the model
structure, re-designing the pre-train tasks, and leveraging external data and
knowledge. The fine-tuning strategy itself has yet to be fully explored. In
this paper, we improve the fine-tuning of BERT with two effective mechanisms:
self-ensemble and self-distillation. The experiments on text classification and
natural language inference tasks show our proposed methods can significantly
improve the adaption of BERT without any external data or knowledge.
Related papers
- LegalTurk Optimized BERT for Multi-Label Text Classification and NER [0.0]
We introduce our innovative modified pre-training approach by combining diverse masking strategies.
In this work, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification.
Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model.
arXiv Detail & Related papers (2024-06-30T10:19:54Z) - Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks.
Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z) - PromptBERT: Improving BERT Sentence Embeddings with Prompts [95.45347849834765]
We propose a prompt based sentence embeddings method which can reduce token embeddings biases and make the original BERT layers more effective.
We also propose a novel unsupervised training objective by the technology of template denoising, which substantially shortens the performance gap between the supervised and unsupervised setting.
Our fine-tuned method outperforms the state-of-the-art method SimCSE in both unsupervised and supervised settings.
arXiv Detail & Related papers (2022-01-12T06:54:21Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - Using Prior Knowledge to Guide BERT's Attention in Semantic Textual
Matching Tasks [13.922700041632302]
We study the problem of incorporating prior knowledge into a deep Transformer-based model,i.e.,Bidirectional Representations from Transformers (BERT)
We obtain better understanding of what task-specific knowledge BERT needs the most and where it is most needed.
Experiments demonstrate that the proposed knowledge-enhanced BERT is able to consistently improve semantic textual matching performance.
arXiv Detail & Related papers (2021-02-22T12:07:16Z) - Augmenting BERT Carefully with Underrepresented Linguistic Features [6.096779295981379]
Fine-tuned Bidirectional Representations from Transformers (BERT)-based sequence classification models have proven to be effective for detecting Alzheimer's Disease (AD) from transcripts of human speech.
Previous research shows it is possible to improve BERT's performance on various tasks by augmenting the model with additional information.
We show that jointly fine-tuning BERT in combination with these features improves the performance of AD classification by upto 5% over fine-tuned BERT alone.
arXiv Detail & Related papers (2020-11-12T01:32:41Z) - DagoBERT: Generating Derivational Morphology with a Pretrained Language
Model [20.81930455526026]
We show that pretrained language models (PLMs) can generate derivationally complex words.
Our best model, DagoBERT, clearly outperforms the previous state of the art in derivation generation.
Our experiments show that the input segmentation crucially impacts BERT's derivational knowledge.
arXiv Detail & Related papers (2020-05-02T01:26:46Z) - Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less
Forgetting [66.45372974713189]
We propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks.
Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark.
We provide open-source RecAdam, which integrates the proposed mechanisms into Adam to facility the NLP community.
arXiv Detail & Related papers (2020-04-27T08:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.