Scaling Laws and Interpretability of Learning from Repeated Data
- URL: http://arxiv.org/abs/2205.10487v1
- Date: Sat, 21 May 2022 02:14:27 GMT
- Title: Scaling Laws and Interpretability of Learning from Repeated Data
- Authors: Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain,
Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan
Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei,
Nicholas Joseph, Jared Kaplan and Sam McCandlish
- Abstract summary: We train a family of models where most of the data is unique but a small fraction of it is repeated many times.
We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training.
A predictable range of repetition frequency leads to surprisingly severe degradation in performance.
- Score: 4.3242395495523525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models have been trained on vast datasets, but also
often on repeated data, either intentionally for the purpose of upweighting
higher quality data, or unintentionally because data deduplication is not
perfect and the model is exposed to repeated data at the sentence, paragraph,
or document level. Some works have reported substantial negative performance
effects of this repeated data. In this paper we attempt to study repeated data
systematically and to understand its effects mechanistically. To do this, we
train a family of models where most of the data is unique but a small fraction
of it is repeated many times. We find a strong double descent phenomenon, in
which repeated data can lead test loss to increase midway through training. A
predictable range of repetition frequency leads to surprisingly severe
degradation in performance. For instance, performance of an 800M parameter
model can be degraded to that of a 2x smaller model (400M params) by repeating
0.1% of the data 100 times, despite the other 90% of the training tokens
remaining unique. We suspect there is a range in the middle where the data can
be memorized and doing so consumes a large fraction of the model's capacity,
and this may be where the peak of degradation occurs. Finally, we connect these
observations to recent mechanistic interpretability work - attempting to
reverse engineer the detailed computations performed by the model - by showing
that data repetition disproportionately damages copying and internal structures
associated with generalization, such as induction heads, providing a possible
mechanism for the shift from generalization to memorization. Taken together,
these results provide a hypothesis for why repeating a relatively small
fraction of data in large language models could lead to disproportionately
large harms to performance.
Related papers
- Strong Model Collapse [16.071600606637908]
We consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon.
Our results show that even the smallest fraction of synthetic data can lead to model collapse.
We investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse.
arXiv Detail & Related papers (2024-10-07T08:54:23Z) - Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences [20.629333587044012]
We study the impact of data curation on iterated retraining of generative models.
We prove that, if the data is curated according to a reward model, the expected reward of the iterative retraining procedure is maximized.
arXiv Detail & Related papers (2024-06-12T21:28:28Z) - Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data [49.73114504515852]
We show that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse.
We demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse.
arXiv Detail & Related papers (2024-04-01T18:31:24Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Repetition In Repetition Out: Towards Understanding Neural Text
Degeneration from the Data Perspective [91.14291142262262]
This work presents a straightforward and fundamental explanation from the data perspective.
Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data.
Our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.
arXiv Detail & Related papers (2023-10-16T09:35:42Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - Reminding the Incremental Language Model via Data-Free Self-Distillation [26.960750314663294]
Incremental language learning with pseudo-data can alleviate catastrophic forgetting in neural networks.
We propose reminding incremental language model via data-free self-distillation (DFSD)
Our DFSD can exceed the previous state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.
arXiv Detail & Related papers (2021-10-17T07:27:43Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification)
We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs.
Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.