A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
- URL: http://arxiv.org/abs/2305.13169v2
- Date: Mon, 13 Nov 2023 14:50:06 GMT
- Title: A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
- Authors: Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam
Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno,
Daphne Ippolito
- Abstract summary: This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
- Score: 84.6421260559093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretraining is the preliminary and fundamental step in developing capable
language models (LM). Despite this, pretraining data design is critically
under-documented and often guided by empirically unsupported intuitions. To
address this, we pretrain 28 1.5B parameter decoder-only models, training on
data curated (1) at different times, (2) with varying toxicity and quality
filters, and (3) with different domain compositions. First, we quantify the
effect of pretraining data age. A temporal shift between evaluation data and
pretraining data leads to performance degradation, which is not overcome by
finetuning. Second, we explore the effect of quality and toxicity filters,
showing a trade-off between performance on standard benchmarks and risk of
toxic generations. Our findings indicate there does not exist a
one-size-fits-all solution to filtering training data. We also find that the
effects of different types of filtering are not predictable from text domain
characteristics. Lastly, we empirically validate that the inclusion of
heterogeneous data sources, like books and web, is broadly beneficial and
warrants greater prioritization. These findings constitute the largest set of
experiments to validate, quantify, and expose many undocumented intuitions
about text pretraining, which we hope will help support more informed
data-centric decisions in LM development.
Related papers
- Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process.
We conducted a thorough examination of 150 classification datasets.
We find that pre-training can only help improve the optimization process for models that fit the data poorly.
Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z) - On the Connection between Pre-training Data Diversity and Fine-tuning
Robustness [66.30369048726145]
We find that the primary factor influencing downstream effective robustness is data quantity.
We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources.
arXiv Detail & Related papers (2023-07-24T05:36:19Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.