Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
- URL: http://arxiv.org/abs/2402.15938v3
- Date: Fri, 31 May 2024 17:49:03 GMT
- Title: Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
- Authors: Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, Ge Li,
- Abstract summary: CDD stands for Contamination Detection via output Distribution for LLMs.
To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
- Score: 42.958880063727996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect implicit contamination. TED substantially mitigates performance improvements up to 66.9\% attributed to data contamination across various contamination setups. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.
Related papers
- DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning [40.57095898475888]
We argue that even training on data similar to benchmark data inflates performance on in-distribution tasks without improving overall capacity.
To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination.
Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets.
arXiv Detail & Related papers (2024-06-06T15:55:53Z) - How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
With the rise of Large Language Models (LLMs) in recent years, new opportunities are emerging, but also new challenges, and contamination is quickly becoming critical.
Business applications and fundraising in AI have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars.
It is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Investigating Data Contamination for Pre-training Language Models [46.335755305642564]
We explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models.
We highlight the effect of both text contamination (textiti.e. input text of the evaluation samples) and ground-truth contamination (textiti.e. the prompts asked on the input and the desired outputs) from evaluation data.
arXiv Detail & Related papers (2024-01-11T17:24:49Z) - Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data [62.56890808004615]
We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision-making.
We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks.
arXiv Detail & Related papers (2023-12-17T00:42:42Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
for each Benchmark [19.875954121100005]
We argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble.
The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark.
This position paper defines different levels of data contamination and argues for a community effort.
arXiv Detail & Related papers (2023-10-27T09:48:29Z) - Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.