Related papers: How much can we forget about Data Contamination?

How much can we forget about Data Contamination?

URL: http://arxiv.org/abs/2410.03249v2
Date: Sat, 26 Oct 2024 03:33:26 GMT
Title: How much can we forget about Data Contamination?
Authors: Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike von Luxburg,
Abstract summary: Leakage of benchmark data into the training data has emerged as a significant challenge for large language models. We use experimental evidence and theoretical estimates to challenge the common assumption that small-scale contamination renders benchmark evaluations invalid.
Score: 15.893161447368273
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we use experimental evidence and theoretical estimates to challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). We find that if model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. We then derive a simple theory of example forgetting via cumulative weight decay. It allows us to bound the number of gradient steps required to forget past data for any training run where we know the hyperparameters of AdamW. This indicates that many LLMs, including Llama 3, have forgotten the data seen at the beginning of training. Experimentally, we demonstrate that forgetting occurs faster than what is predicted by our bounds. Taken together, our results suggest that moderate amounts of contamination can be forgotten at the end of realistically scaled training runs.

Related papers

DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training [16.441081996257576]
Large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks. We construct a large-scale, difficulty-graded reasoning dataset containing about 3.34 million unique queries of varying difficulty levels. We significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2% on the AIME2024 mathematical reasoning benchmark.
arXiv Detail & Related papers (2025-04-24T13:57:53Z)
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? [55.33467849079774]
Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of Large Language Models. We investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge.
arXiv Detail & Related papers (2025-02-20T12:31:03Z)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [81.34900892130929]
We explore inference compute as another axis for scaling by increasing the number of generated samples. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. We find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers.
arXiv Detail & Related papers (2024-07-31T17:57:25Z)
Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks. To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks. However, it is not expected in practice considering the memory constraint or data privacy issue. As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z)
SwiftLearn: A Data-Efficient Training Method of Deep Learning Models using Importance Sampling [3.8330834108666667]
We present SwiftLearn, a data-efficient approach to accelerate training of deep learning models. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages. We show that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
arXiv Detail & Related papers (2023-11-25T22:51:01Z)
Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks. This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z)
Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs [71.56345106591789]
It has been believed that weights in large language models (LLMs) contain significant redundancy. This paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks.
arXiv Detail & Related papers (2023-09-29T22:55:06Z)
Exploring Weight Balancing on Long-Tailed Recognition Problem [32.01426831450348]
Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance. Weight balancing, which combines classical regularization techniques with two-stage training, has been proposed. We analyze weight balancing by focusing on neural collapse and the cone effect at each training stage.
arXiv Detail & Related papers (2023-05-26T01:45:19Z)
Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models. We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Input Perturbation Reduces Exposure Bias in Diffusion Models [41.483581603727444]
We show that a long sampling chain leads to an error accumulation phenomenon, similar to the exposure bias problem in autoregressive text generation. We propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality.
arXiv Detail & Related papers (2023-01-27T13:34:54Z)
Data Contamination: From Memorization to Exploitation [5.997909991352044]
It is not clear to what extent models exploit contaminated data for downstream tasks. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it.
arXiv Detail & Related papers (2022-03-15T20:37:16Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.