How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence
- URL: http://arxiv.org/abs/2502.00678v1
- Date: Sun, 02 Feb 2025 05:50:39 GMT
- Title: How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence
- Authors: Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li,
- Abstract summary: datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations.
We propose Kernel Divergence Score (KDS), a novel method that quantifies dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings.
KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines.
- Score: 23.019102917957152
- License:
- Abstract: Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Quantifying dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that quantifies dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings.
Related papers
- A Survey on Data Contamination for Large Language Models [12.431575579432458]
Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis.
The reliability of performance evaluation has come under scrutiny due to data contamination.
arXiv Detail & Related papers (2025-02-20T10:23:27Z) - AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models [2.463617251923349]
AdEval is a dynamic data evaluation method aimed at mitigating the impact of data contamination on evaluation reliability.
It extracts key knowledge points and main ideas to align dynamically generated questions with static data's core concepts.
It also leverages online search to provide detailed explanations of related knowledge points, thereby creating high-quality evaluation samples.
arXiv Detail & Related papers (2025-01-23T06:57:24Z) - Testing and Improving the Robustness of Amortized Bayesian Inference for Cognitive Models [0.5223954072121659]
Contaminant observations and outliers often cause problems when estimating the parameters of cognitive models.
In this study, we test and improve the robustness of parameter estimation using amortized Bayesian inference.
The proposed method is straightforward and practical to implement and has a broad applicability in fields where outlier detection or removal is challenging.
arXiv Detail & Related papers (2024-12-29T21:22:24Z) - Adversarial Purification by Consistency-aware Latent Space Optimization on Data Manifolds [48.37843602248313]
Deep neural networks (DNNs) are vulnerable to adversarial samples crafted by adding imperceptible perturbations to clean data, potentially leading to incorrect and dangerous predictions.
We propose Consistency Model-based Adversarial Purification (CMAP), which optimize vectors within the latent space of a pre-trained consistency model to generate samples for restoring clean data.
CMAP significantly enhances robustness against strong adversarial attacks while preserving high natural accuracy.
arXiv Detail & Related papers (2024-12-11T14:14:02Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - ConStat: Performance-Based Contamination Detection in Large Language Models [7.305342793164905]
ConStat is a statistical method that reliably detects and quantifies contamination by comparing performance between a primary and reference benchmark relative to a set of reference models.
We demonstrate the effectiveness of ConStat in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios.
arXiv Detail & Related papers (2024-05-25T15:36:37Z) - Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models [42.958880063727996]
CDD stands for Contamination Detection via output Distribution for LLMs.
To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution.
arXiv Detail & Related papers (2024-02-24T23:54:41Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - The Risk of Federated Learning to Skew Fine-Tuning Features and
Underperform Out-of-Distribution Robustness [50.52507648690234]
Federated learning has the risk of skewing fine-tuning features and compromising the robustness of the model.
We introduce three robustness indicators and conduct experiments across diverse robust datasets.
Our approach markedly enhances the robustness across diverse scenarios, encompassing various parameter-efficient fine-tuning methods.
arXiv Detail & Related papers (2024-01-25T09:18:51Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Decomposed Adversarial Learned Inference [118.27187231452852]
We propose a novel approach, Decomposed Adversarial Learned Inference (DALI)
DALI explicitly matches prior and conditional distributions in both data and code spaces.
We validate the effectiveness of DALI on the MNIST, CIFAR-10, and CelebA datasets.
arXiv Detail & Related papers (2020-04-21T20:00:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.