Self-training For Pre-training Language Models
- URL: http://arxiv.org/abs/2011.09031v3
- Date: Wed, 19 May 2021 00:44:11 GMT
- Title: Self-training For Pre-training Language Models
- Authors: Tong Guo
- Abstract summary: In industry NLP applications, we have large amounts of data produced by users or customers.
Our learning framework is based on this large amounts of unlabel data.
- Score: 0.5139874302398955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language model pre-training has proven to be useful in many language
understanding tasks. In this paper, we investigate whether it is still helpful
to add the self-training method in the pre-training step and the fine-tuning
step. Towards this goal, we propose a learning framework that making best use
of the unlabel data on the low-resource and high-resource labeled dataset. In
industry NLP applications, we have large amounts of data produced by users or
customers. Our learning framework is based on this large amounts of unlabel
data. First, We use the model fine-tuned on manually labeled dataset to predict
pseudo labels for the user-generated unlabeled data. Then we use the pseudo
labels to supervise the task-specific training on the large amounts of
user-generated data. We consider this task-specific training step on pseudo
labels as a pre-training step for the next fine-tuning step. At last, we
fine-tune on the manually labeled dataset upon the pre-trained model. In this
work, we first empirically show that our method is able to solidly improve the
performance by 3.6%, when the manually labeled fine-tuning dataset is
relatively small. Then we also show that our method still is able to improve
the performance further by 0.2%, when the manually labeled fine-tuning dataset
is relatively large enough. We argue that our method make the best use of the
unlabel data, which is superior to either pre-training or self-training alone.
Related papers
- FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness
for Semi-Supervised Learning [73.13448439554497]
Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data.
Most SSL methods are commonly based on instance-wise consistency between different data transformations.
We propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets.
arXiv Detail & Related papers (2023-10-25T06:57:59Z) - Doubly Robust Self-Training [46.168395767948965]
We introduce doubly robust self-training, a novel semi-supervised algorithm.
We demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
arXiv Detail & Related papers (2023-06-01T00:57:16Z) - Online pseudo labeling for polyp segmentation with momentum networks [5.920947681019466]
In semi-supervised learning, the quality of labels plays a crucial role in model performance.
We present a new pseudo labeling strategy that enhances the quality of pseudo labels used for training student networks.
Our results surpass common practice by 3% and even approach fully-supervised results on some datasets.
arXiv Detail & Related papers (2022-09-29T07:33:54Z) - SLADE: A Self-Training Framework For Distance Metric Learning [75.54078592084217]
We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data.
We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data.
We then train a student model on both labels and pseudo labels to generate final feature embeddings.
arXiv Detail & Related papers (2020-11-20T08:26:10Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.