Related papers: Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

URL: http://arxiv.org/abs/2402.15938v3
Date: Fri, 31 May 2024 17:49:03 GMT
Title: Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
Authors: Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, Ge Li,
Abstract summary: CDD stands for Contamination Detection via output Distribution for LLMs. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
Score: 42.958880063727996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM's output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM's output distribution. To facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect implicit contamination. TED substantially mitigates performance improvements up to 66.9\% attributed to data contamination across various contamination setups. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.

Related papers

AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge. We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z)
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? [10.691754344782387]
It is difficult to define precisely which samples should be considered contaminated, and how it impacts benchmark scores. We propose a novel analysis method called ConTAM, and show with a large scale survey of evaluation data contamination metrics. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales.
arXiv Detail & Related papers (2024-11-06T13:54:08Z)
CAP: Data Contamination Detection via Consistency Amplification [20.135264289668463]
Large language models (LLMs) are widely used, but concerns about data contamination challenge their reliability. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage. CAP is applicable to various benchmarks and works for both white-box and black-box models.
arXiv Detail & Related papers (2024-10-19T06:33:33Z)
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning [40.57095898475888]
In-distribution contamination can inflate performance of large language models (LLMs) We propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets.
arXiv Detail & Related papers (2024-06-06T15:55:53Z)
How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI. LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
Investigating Data Contamination for Pre-training Language Models [46.335755305642564]
We explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models. We highlight the effect of both text contamination (textiti.e. input text of the evaluation samples) and ground-truth contamination (textiti.e. the prompts asked on the input and the desired outputs) from evaluation data.
arXiv Detail & Related papers (2024-01-11T17:24:49Z)
Interpretable Causal Inference for Analyzing Wearable, Sensor, and Distributional Data [62.56890808004615]
We develop an interpretable method for distributional data analysis that ensures trustworthy and robust decision-making. We demonstrate ADD MALTS' utility by studying the effectiveness of continuous glucose monitors in mitigating diabetes risks.
arXiv Detail & Related papers (2023-12-17T00:42:42Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks. This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.