Investigating Data Contamination for Pre-training Language Models
- URL: http://arxiv.org/abs/2401.06059v1
- Date: Thu, 11 Jan 2024 17:24:49 GMT
- Title: Investigating Data Contamination for Pre-training Language Models
- Authors: Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang,
Jiawei Han, Sanmi Koyejo
- Abstract summary: We explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models.
We highlight the effect of both text contamination (textiti.e. input text of the evaluation samples) and ground-truth contamination (textiti.e. the prompts asked on the input and the desired outputs) from evaluation data.
- Score: 46.335755305642564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models pre-trained on web-scale corpora demonstrate impressive
capabilities on diverse downstream tasks. However, there is increasing concern
whether such capabilities might arise from evaluation datasets being included
in the pre-training corpus -- a phenomenon known as \textit{data contamination}
-- in a manner that artificially increases performance. There has been little
understanding of how this potential contamination might influence LMs'
performance on downstream tasks. In this paper, we explore the impact of data
contamination at the pre-training stage by pre-training a series of GPT-2
models \textit{from scratch}. We highlight the effect of both text
contamination (\textit{i.e.}\ input text of the evaluation samples) and
ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and
the desired outputs) from evaluation data. We also investigate the effects of
repeating contamination for various downstream tasks. Additionally, we examine
the prevailing n-gram-based definitions of contamination within current LLM
reports, pinpointing their limitations and inadequacy. Our findings offer new
insights into data contamination's effects on language model capabilities and
underscore the need for independent, comprehensive contamination assessments in
LLM studies.
Related papers
- Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? [10.691754344782387]
It is difficult to define precisely which samples should be considered contaminated, and how it impacts benchmark scores.
We propose a novel analysis method called ConTAM, and show with a large scale survey of evaluation data contamination metrics.
We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales.
arXiv Detail & Related papers (2024-11-06T13:54:08Z) - Assessing Contamination in Large Language Models: Introducing the LogProber method [17.91379291654773]
In machine learning, contamination refers to situations where testing data leak into the training set.
In the present paper we introduce LogProber, a novel, efficient, algorithm that we show able to detect contamination using token probability in given sentences.
arXiv Detail & Related papers (2024-08-26T15:29:34Z) - A Taxonomy for Data Contamination in Large Language Models [12.643103231497813]
A growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus.
Decontamination, the process of detecting and removing such data, is a potential solution.
How different types of contamination impact the performance of language models on downstream tasks is not fully understood.
arXiv Detail & Related papers (2024-07-11T17:50:34Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs.
To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Data Contamination: From Memorization to Exploitation [5.997909991352044]
It is not clear to what extent models exploit contaminated data for downstream tasks.
We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task.
Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it.
arXiv Detail & Related papers (2022-03-15T20:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.