Related papers: Investigating Data Contamination for Pre-training Language Models

Investigating Data Contamination for Pre-training Language Models

URL: http://arxiv.org/abs/2401.06059v1
Date: Thu, 11 Jan 2024 17:24:49 GMT
Title: Investigating Data Contamination for Pre-training Language Models
Authors: Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo
Abstract summary: We explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models. We highlight the effect of both text contamination (textiti.e. input text of the evaluation samples) and ground-truth contamination (textiti.e. the prompts asked on the input and the desired outputs) from evaluation data.
Score: 46.335755305642564
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination's effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

Related papers

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation [46.148465860465095]
We study the effects of contamination on language models at 1B and 8B scales on the machine translation task. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations.
arXiv Detail & Related papers (2025-01-30T21:51:18Z)
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? [10.691754344782387]
It is difficult to define precisely which samples should be considered contaminated, and how it impacts benchmark scores. We propose a novel analysis method called ConTAM, and show with a large scale survey of evaluation data contamination metrics. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales.
arXiv Detail & Related papers (2024-11-06T13:54:08Z)
Assessing Contamination in Large Language Models: Introducing the LogProber method [17.91379291654773]
In machine learning, contamination refers to situations where testing data leak into the training set. In the present paper we introduce LogProber, a novel, efficient, algorithm that we show able to detect contamination using token probability in given sentences.
arXiv Detail & Related papers (2024-08-26T15:29:34Z)
A Taxonomy for Data Contamination in Large Language Models [12.643103231497813]
A growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus. Decontamination, the process of detecting and removing such data, is a potential solution. How different types of contamination impact the performance of language models on downstream tasks is not fully understood.
arXiv Detail & Related papers (2024-07-11T17:50:34Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
A Comprehensive Survey of Contamination Detection Methods in Large Language Models [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges. LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes real capability improvement in the field of NLP, yet there remains a lack of methods on how to efficiently detect contamination.
arXiv Detail & Related papers (2024-03-31T14:32:02Z)
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z)
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z)
Data Contamination: From Memorization to Exploitation [5.997909991352044]
It is not clear to what extent models exploit contaminated data for downstream tasks. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it.
arXiv Detail & Related papers (2022-03-15T20:37:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.