Related papers: DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning

URL: http://arxiv.org/abs/2406.04197v2
Date: Sun, 22 Sep 2024 12:40:35 GMT
Title: DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
Authors: Shangqing Tu, Kejian Zhu, Yushi Bai, Zijun Yao, Lei Hou, Juanzi Li,
Abstract summary: In-distribution contamination can inflate performance of large language models (LLMs) We propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets.
Score: 40.57095898475888
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. Besides, prior work has already shown that even training on data similar to benchmark data inflates performance, namely \emph{In-distribution contamination}. In this work, we argue that in-distribution contamination can lead to the performance drop on OOD benchmarks. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. DICE first identifies the most sensitive layer to contamination, then trains a classifier based on the internal states of that layer. Experiments reveal DICE's high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets. We also show the generalization capability of the trained DICE detector, which is able to detect contamination across multiple benchmarks with similar distributions. Additionally, we find that DICE's predictions correlate with the performance of LLMs fine-tuned by either us or other organizations, achieving a coefficient of determination ($R^2$) between 0.61 and 0.75. The code and data are available at https://github.com/THU-KEG/DICE.

Related papers

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? [10.691754344782387]
It is difficult to define precisely which samples should be considered contaminated, and how it impacts benchmark scores. We propose a novel analysis method called ConTAM, and show with a large scale survey of evaluation data contamination metrics. We find that contamination may have a much larger effect than reported in recent LLM releases and benefits models differently at different scales.
arXiv Detail & Related papers (2024-11-06T13:54:08Z)
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions [20.51842378080194]
Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. As LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination. We systematically review 50 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated.
arXiv Detail & Related papers (2024-10-24T17:58:22Z)
CAP: Data Contamination Detection via Consistency Amplification [20.135264289668463]
Large language models (LLMs) are widely used, but concerns about data contamination challenge their reliability. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage. CAP is applicable to various benchmarks and works for both white-box and black-box models.
arXiv Detail & Related papers (2024-10-19T06:33:33Z)
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z)
Resultant: Incremental Effectiveness on Likelihood for Unsupervised Out-of-Distribution Detection [63.93728560200819]
Unsupervised out-of-distribution (U-OOD) detection is to identify data samples with a detector trained solely on unlabeled in-distribution (ID) data. Recent studies have developed various detectors based on DGMs to move beyond likelihood. We apply two techniques for each direction, specifically post-hoc prior and dataset entropy-mutual calibration. Experimental results demonstrate that the Resultant could be a new state-of-the-art U-OOD detector.
arXiv Detail & Related papers (2024-09-05T02:58:13Z)
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance. We propose Inference-Time Decontamination (ITD) to address this issue. ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z)
Data Contamination Can Cross Language Barriers [29.103517721155487]
The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. We first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods. We propose generalization-based approaches to unmask such deeply concealed contamination.
arXiv Detail & Related papers (2024-06-19T05:53:27Z)
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks. This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z)
Hierarchical Semi-Supervised Contrastive Learning for Contamination-Resistant Anomaly Detection [81.07346419422605]
Anomaly detection aims at identifying deviant samples from the normal data distribution. Contrastive learning has provided a successful way to sample representation that enables effective discrimination on anomalies. We propose a novel hierarchical semi-supervised contrastive learning framework, for contamination-resistant anomaly detection.
arXiv Detail & Related papers (2022-07-24T18:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.