Deduplicating Training Data Makes Language Models Better
- URL: http://arxiv.org/abs/2107.06499v1
- Date: Wed, 14 Jul 2021 06:06:52 GMT
- Title: Deduplicating Training Data Makes Language Models Better
- Authors: Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas
Eck, Chris Callison-Burch, Nicholas Carlini
- Abstract summary: Existing language modeling datasets contain many near-duplicate examples and long repetitives.
Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.
We develop two tools that allow us to deduplicate training datasets.
- Score: 50.22588162039083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We find that existing language modeling datasets contain many near-duplicate
examples and long repetitive substrings. As a result, over 1% of the unprompted
output of language models trained on these datasets is copied verbatim from the
training data. We develop two tools that allow us to deduplicate training
datasets -- for example removing from C4 a single 61 word English sentence that
is repeated over 60,000 times. Deduplication allows us to train models that
emit memorized text ten times less frequently and require fewer train steps to
achieve the same or better accuracy. We can also reduce train-test overlap,
which affects over 4% of the validation set of standard datasets, thus allowing
for more accurate evaluation. We release code for reproducing our work and
performing dataset deduplication at
https://github.com/google-research/deduplicate-text-datasets.
Related papers
- Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Constructing Multilingual Code Search Dataset Using Neural Machine
Translation [48.32329232202801]
We create a multilingual code search dataset in four natural and four programming languages.
Our results show that the model pre-trained with all natural and programming language data has performed best in most cases.
arXiv Detail & Related papers (2023-06-27T16:42:36Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - SemDeDup: Data-efficient learning at web-scale through semantic
deduplication [34.38272674518666]
We introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates.
We show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time.
Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains.
arXiv Detail & Related papers (2023-03-16T17:53:24Z) - Scaling Laws and Interpretability of Learning from Repeated Data [4.3242395495523525]
We train a family of models where most of the data is unique but a small fraction of it is repeated many times.
We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training.
A predictable range of repetition frequency leads to surprisingly severe degradation in performance.
arXiv Detail & Related papers (2022-05-21T02:14:27Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.