Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
- URL: http://arxiv.org/abs/2311.06233v6
- Date: Fri, 24 May 2024 06:14:09 GMT
- Title: Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
- Authors: Shahriar Golchin, Mihai Surdeanu,
- Abstract summary: We propose a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it.
We frame data contamination detection as a series of multiple-choice questions and devise a quiz format wherein three perturbed versions of each subsampled instance from a specific dataset partition are created.
Our findings suggest that DCQ achieves state-of-the-art results and uncovers greater contamination/memorization levels compared to existing methods.
- Score: 25.022166664832596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the Data Contamination Quiz (DCQ), a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it. Specifically, we frame data contamination detection as a series of multiple-choice questions and devise a quiz format wherein three perturbed versions of each subsampled instance from a specific dataset partition (e.g., GSM8k test set) are created. These changes only include word-level perturbations. The generated perturbations, along with the original dataset instance, form the options in the DCQ, with an extra option accommodating the possibility of selecting none of the provided options. Given that the only distinguishing signal among the options is the exact wording with respect to the original dataset instance, an LLM, when tasked with identifying the original dataset instance, gravitates towards selecting the original one if it has been exposed to it in its pre-training phase -- a trait intrinsic to LLMs. While accounting for positional biases in LLMs, the quiz performance reveals the contamination level for the model being examined with the dataset partition to which the quiz pertains. Applied to various datasets with GPT-4 and GPT-3.5, our findings -- while fully lacking access to pre-training data and model parameters -- suggest that DCQ achieves state-of-the-art results and uncovers greater contamination/memorization levels compared to existing methods and proficiently bypasses more safety filters, especially those set to avoid generating copyrighted contents.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z) - Model Selection of Zero-shot Anomaly Detectors in the Absence of Labeled
Validation Data [19.919234682696306]
Anomaly detection requires detecting abnormal samples in large unlabeled datasets.
We propose SWSA: a framework to select image-based anomaly detectors with a generated synthetic validation set.
We find that SWSA often selects models that match selections made with a ground-truth validation set, resulting in higher AUROCs than baseline methods.
arXiv Detail & Related papers (2023-10-16T14:42:22Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Time Travel in LLMs: Tracing Data Contamination in Large Language Models [29.56037518816495]
We propose a straightforward yet effective method for identifying data contamination within large language models (LLMs)
At its core, our approach starts by identifying potential contamination at the instance level.
To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance.
arXiv Detail & Related papers (2023-08-16T16:48:57Z) - Industrial Anomaly Detection with Domain Shift: A Real-world Dataset and
Masked Multi-scale Reconstruction [2.921945366485149]
Industrial anomaly detection (IAD) is crucial for automating industrial quality inspection.
Existing IAD datasets focus on the diversity of data categories.
We propose the Aero-engine Blade Anomaly Detection (AeBAD) dataset, consisting of two sub-datasets.
arXiv Detail & Related papers (2023-04-05T04:07:54Z) - Unsupervised Model Selection for Time-series Anomaly Detection [7.8027110514393785]
We identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies.
We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem.
Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model.
arXiv Detail & Related papers (2022-10-03T16:49:30Z) - Pareto Optimization for Active Learning under Out-of-Distribution Data
Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool.
Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z) - Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D
Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on.
We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z) - Distributed Multivariate Regression Modeling For Selecting Biomarkers
Under Data Protection Constraints [0.0]
We propose a multivariable regression approach for identifying biomarkers by automatic variable selection based on aggregated data in iterative calls.
The approach can be used to jointly analyze data distributed across several locations.
In a simulation, the information loss introduced by local standardization is seen to be minimal.
arXiv Detail & Related papers (2018-03-01T15:04:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.