Related papers: Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models

Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models

URL: http://arxiv.org/abs/2311.06233v6
Date: Fri, 24 May 2024 06:14:09 GMT
Title: Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
Authors: Shahriar Golchin, Mihai Surdeanu,
Abstract summary: We propose a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it. We frame data contamination detection as a series of multiple-choice questions and devise a quiz format wherein three perturbed versions of each subsampled instance from a specific dataset partition are created. Our findings suggest that DCQ achieves state-of-the-art results and uncovers greater contamination/memorization levels compared to existing methods.
Score: 25.022166664832596
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose the Data Contamination Quiz (DCQ), a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it. Specifically, we frame data contamination detection as a series of multiple-choice questions and devise a quiz format wherein three perturbed versions of each subsampled instance from a specific dataset partition (e.g., GSM8k test set) are created. These changes only include word-level perturbations. The generated perturbations, along with the original dataset instance, form the options in the DCQ, with an extra option accommodating the possibility of selecting none of the provided options. Given that the only distinguishing signal among the options is the exact wording with respect to the original dataset instance, an LLM, when tasked with identifying the original dataset instance, gravitates towards selecting the original one if it has been exposed to it in its pre-training phase -- a trait intrinsic to LLMs. While accounting for positional biases in LLMs, the quiz performance reveals the contamination level for the model being examined with the dataset partition to which the quiz pertains. Applied to various datasets with GPT-4 and GPT-3.5, our findings -- while fully lacking access to pre-training data and model parameters -- suggest that DCQ achieves state-of-the-art results and uncovers greater contamination/memorization levels compared to existing methods and proficiently bypasses more safety filters, especially those set to avoid generating copyrighted contents.

Related papers

CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking [85.68235482145091]
Large-scale speech datasets have become valuable intellectual property. We propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW) We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks.
arXiv Detail & Related papers (2025-03-02T02:02:57Z)
PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series [0.01874930567916036]
Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools. Our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature.
arXiv Detail & Related papers (2024-11-21T09:03:12Z)
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions [20.51842378080194]
Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. As LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination. We systematically review 50 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated.
arXiv Detail & Related papers (2024-10-24T17:58:22Z)
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z)
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small. We propose a novel method that augments training data by incorporating a wealth of examples from other datasets. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)
Time Travel in LLMs: Tracing Data Contamination in Large Language Models [29.56037518816495]
We propose a straightforward yet effective method for identifying data contamination within large language models (LLMs) At its core, our approach starts by identifying potential contamination at the instance level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance.
arXiv Detail & Related papers (2023-08-16T16:48:57Z)
Industrial Anomaly Detection with Domain Shift: A Real-world Dataset and Masked Multi-scale Reconstruction [2.921945366485149]
Industrial anomaly detection (IAD) is crucial for automating industrial quality inspection. Existing IAD datasets focus on the diversity of data categories. We propose the Aero-engine Blade Anomaly Detection (AeBAD) dataset, consisting of two sub-datasets.
arXiv Detail & Related papers (2023-04-05T04:07:54Z)
Unsupervised Model Selection for Time-series Anomaly Detection [7.8027110514393785]
We identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model.
arXiv Detail & Related papers (2022-10-03T16:49:30Z)
Pareto Optimization for Active Learning under Out-of-Distribution Data Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z)
Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on. We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z)
Distributed Multivariate Regression Modeling For Selecting Biomarkers Under Data Protection Constraints [0.0]
We propose a multivariable regression approach for identifying biomarkers by automatic variable selection based on aggregated data in iterative calls. The approach can be used to jointly analyze data distributed across several locations. In a simulation, the information loss introduced by local standardization is seen to be minimal.
arXiv Detail & Related papers (2018-03-01T15:04:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.