Related papers: Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

URL: http://arxiv.org/abs/2409.09927v1
Date: Mon, 16 Sep 2024 02:04:33 GMT
Title: Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges
Authors: Vinay Samuel, Yue Zhou, Henry Peng Zou,
Abstract summary: We evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets. Our analysis reveals that current methods have non-trivial limitations in their assumptions and practical applications.
Score: 3.0455427910850785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models achieve increasingly impressive results, questions arise about whether such performance is from generalizability or mere data memorization. Thus, numerous data contamination detection methods have been proposed. However, these approaches are often validated with traditional benchmarks and early-stage LLMs, leaving uncertainty about their effectiveness when evaluating state-of-the-art LLMs on the contamination of more challenging benchmarks. To address this gap and provide a dual investigation of SOTA LLM contamination status and detection method robustness, we evaluate five contamination detection approaches with four state-of-the-art LLMs across eight challenging datasets often used in modern LLM evaluation. Our analysis reveals that (1) Current methods have non-trivial limitations in their assumptions and practical applications; (2) Notable difficulties exist in detecting contamination introduced during instruction fine-tuning with answer augmentation; and (3) Limited consistencies between SOTA contamination detection techniques. These findings highlight the complexity of contamination detection in advanced LLMs and the urgent need for further research on robust and generalizable contamination evaluation. Our code is available at https://github.com/vsamuel2003/data-contamination.

Related papers

Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z)
RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models [0.8739101659113157]
Residual-Noise Fingerprinting (RN-F) is a novel framework for detecting contaminated data in Large Language Models (LLMs)<n>RN-F is a single-pass, gradient-free detection method that leverages residual signal patterns without introducing additional floating-point operations.<n>We show that RN-F consistently outperforms existing state-of-the-art methods, achieving performance improvements of up to 10.5% in contamination detection metrics.
arXiv Detail & Related papers (2025-05-19T15:32:49Z)
A Survey on Data Contamination for Large Language Models [12.431575579432458]
Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. The reliability of performance evaluation has come under scrutiny due to data contamination.
arXiv Detail & Related papers (2025-02-20T10:23:27Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge. We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z)
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination [18.586654412992168]
multimodal large language models (MLLMs) have demonstrated superior performance on various multimodal benchmarks. The issue of data contamination during training creates challenges in performance evaluation and comparison. We introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs.
arXiv Detail & Related papers (2024-11-06T10:44:15Z)
CAP: Data Contamination Detection via Consistency Amplification [20.135264289668463]
Large language models (LLMs) are widely used, but concerns about data contamination challenge their reliability. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage. CAP is applicable to various benchmarks and works for both white-box and black-box models.
arXiv Detail & Related papers (2024-10-19T06:33:33Z)
LogProber: Disentangling confidence from contamination in LLM responses [17.91379291654773]
In machine learning, contamination refers to situations where testing data leak into the training set.<n>We introduce LogProber, a novel, efficient algorithm that we show to be able to detect contamination in a black box setting.
arXiv Detail & Related papers (2024-08-26T15:29:34Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors. We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z)
How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI. LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z)
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.