Downstream Datasets Make Surprisingly Good Pretraining Corpora
- URL: http://arxiv.org/abs/2209.14389v2
- Date: Fri, 26 May 2023 13:46:47 GMT
- Title: Downstream Datasets Make Surprisingly Good Pretraining Corpora
- Authors: Kundan Krishna, Saurabh Garg, Jeffrey P. Bigham, Zachary C. Lipton
- Abstract summary: This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning.
In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus.
Our results hint that performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts.
- Score: 39.77171117174906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For most natural language processing tasks, the dominant practice is to
finetune large pretrained transformer models (e.g., BERT) using smaller
downstream datasets. Despite the success of this approach, it remains unclear
to what extent these gains are attributable to the massive background corpora
employed for pretraining versus to the pretraining objectives themselves. This
paper introduces a large-scale study of self-pretraining, where the same
(downstream) training data is used for both pretraining and finetuning. In
experiments addressing both ELECTRA and RoBERTa models and 10 distinct
downstream classification datasets, we observe that self-pretraining rivals
standard pretraining on the BookWiki corpus (despite using around
$10\times$--$500\times$ less data), outperforming the latter on $7$ and $5$
datasets, respectively. Surprisingly, these task-specific pretrained models
often perform well on other tasks, including the GLUE benchmark. Besides
classification tasks, self-pretraining also provides benefits on structured
output prediction tasks such as span based question answering and commonsense
inference, often providing more than $50\%$ of the performance boosts provided
by pretraining on the BookWiki corpus. Our results hint that in many scenarios,
performance gains attributable to pretraining are driven primarily by the
pretraining objective itself and are not always attributable to the use of
external pretraining data in massive amounts. These findings are especially
relevant in light of concerns about intellectual property and offensive content
in web-scale pretraining data.
Related papers
- Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training [20.98770732015944]
Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data.
We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected.
To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
arXiv Detail & Related papers (2023-06-08T15:26:52Z) - SEPT: Towards Scalable and Efficient Visual Pre-Training [11.345844145289524]
Self-supervised pre-training has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance.
We build a task-specific self-supervised pre-training framework based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains.
arXiv Detail & Related papers (2022-12-11T11:02:11Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z) - Rethinking Pre-training and Self-training [105.27954735761678]
We investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training.
Our study reveals the generality and flexibility of self-training with three additional insights.
For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data.
arXiv Detail & Related papers (2020-06-11T23:59:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.