Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
- URL: http://arxiv.org/abs/2411.03923v1
- Date: Wed, 06 Nov 2024 13:54:08 GMT
- Title: Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
- Authors: Aaditya K. Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, Gergely Szilvasy, Dieuwke Hupkes,
- Abstract summary: It is difficult to define precisely which samples should be considered contaminated, and how it impacts benchmark scores.
We propose a novel analysis method called ConTAM, and show with a large scale survey of evaluation data contamination metrics.
We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales.
- Score: 10.691754344782387
- License:
- Abstract: Hampering the interpretation of benchmark scores, evaluation data contamination has become a growing concern in the evaluation of LLMs, and an active area of research studies its effects. While evaluation data contamination is easily understood intuitively, it is surprisingly difficult to define precisely which samples should be considered contaminated and, consequently, how it impacts benchmark scores. We propose that these questions should be addressed together and that contamination metrics can be assessed based on whether models benefit from the examples they mark contaminated. We propose a novel analysis method called ConTAM, and show with a large scale survey of existing and novel n-gram based contamination metrics across 13 benchmarks and 7 models from 2 different families that ConTAM can be used to better understand evaluation data contamination and its effects. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales. We also find that considering only the longest contaminated substring provides a better signal than considering a union of all contaminated substrings, and that doing model and benchmark specific threshold analysis greatly increases the specificity of the results. Lastly, we investigate the impact of hyperparameter choices, finding that, among other things, both using larger values of n and disregarding matches that are infrequent in the pre-training data lead to many false negatives. With ConTAM, we provide a method to empirically ground evaluation data contamination metrics in downstream effects. With our exploration, we shed light on how evaluation data contamination can impact LLMs and provide insight into the considerations important when doing contamination analysis. We end our paper by discussing these in more detail and providing concrete suggestions for future work.
Related papers
- CAP: Data Contamination Detection via Consistency Amplification [20.135264289668463]
Large language models (LLMs) are widely used, but concerns about data contamination challenge their reliability.
We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage.
CAP is applicable to various benchmarks and works for both white-box and black-box models.
arXiv Detail & Related papers (2024-10-19T06:33:33Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - A Taxonomy for Data Contamination in Large Language Models [12.643103231497813]
A growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus.
Decontamination, the process of detecting and removing such data, is a potential solution.
How different types of contamination impact the performance of language models on downstream tasks is not fully understood.
arXiv Detail & Related papers (2024-07-11T17:50:34Z) - Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs.
To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Investigating Data Contamination for Pre-training Language Models [46.335755305642564]
We explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models.
We highlight the effect of both text contamination (textiti.e. input text of the evaluation samples) and ground-truth contamination (textiti.e. the prompts asked on the input and the desired outputs) from evaluation data.
arXiv Detail & Related papers (2024-01-11T17:24:49Z) - Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data [62.56890808004615]
We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision-making.
We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks.
arXiv Detail & Related papers (2023-12-17T00:42:42Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
for each Benchmark [19.875954121100005]
We argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble.
The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark.
This position paper defines different levels of data contamination and argues for a community effort.
arXiv Detail & Related papers (2023-10-27T09:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.