Estimating Contamination via Perplexity: Quantifying Memorisation in
Language Model Evaluation
- URL: http://arxiv.org/abs/2309.10677v2
- Date: Wed, 27 Sep 2023 01:15:49 GMT
- Title: Estimating Contamination via Perplexity: Quantifying Memorisation in
Language Model Evaluation
- Authors: Yucheng Li
- Abstract summary: We propose a novel method to quantify contamination without the access of the full training set.
Our analysis provides evidence of significant memorisation of recent foundation models in popular reading comprehension, summarisation benchmarks, while multiple choice appears less contaminated.
- Score: 2.4173424114751114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data contamination in model evaluation is getting increasingly prevalent as
the massive training corpora of large language models often unintentionally
include benchmark samples. Therefore, contamination analysis has became an
inevitable part of reliable model evaluation. However, existing method of
contamination analysis requires the access of the entire training data which is
often confidential for recent models. This prevent the community to rigorously
audit these models and conduct accurate assessment of their capability. In this
paper, we propose a novel method to quantify contamination without the access
of the full training set, that measure the extent of contamination with
perplexity. Our analysis provides evidence of significant memorisation of
recent foundation models in popular reading comprehension, summarisation
benchmarks, while multiple choice appears less contaminated.
Related papers
- Training on the Test Model: Contamination in Ranking Distillation [14.753216172912968]
We investigate the effect of a contaminated teacher model in a distillation setting.
We find that contamination occurs even when the test data represents a small fraction of the teacher's training samples.
arXiv Detail & Related papers (2024-11-04T17:11:14Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - ConStat: Performance-Based Contamination Detection in Large Language Models [7.305342793164905]
ConStat is a statistical method that reliably detects and quantifies contamination by comparing performance between a primary and reference benchmark relative to a set of reference models.
We demonstrate the effectiveness of ConStat in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios.
arXiv Detail & Related papers (2024-05-25T15:36:37Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Evading Data Contamination Detection for Language Models is (too) Easy [9.024665800235855]
Large language models can inadvertently lead to contamination with public benchmarks.
We propose a categorization of both model providers and contamination detection methods.
This reveals vulnerabilities in existing methods that we exploit with EAL.
arXiv Detail & Related papers (2024-02-05T09:10:32Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - An Open Source Data Contamination Report for Large Language Models [21.553915781660905]
This paper presents an extensive data contamination report for over 15 popular large language models.
We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models.
arXiv Detail & Related papers (2023-10-26T17:11:42Z) - Learning Sample Difficulty from Pre-trained Models for Reliable
Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization.
We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z) - The Implicit Delta Method [61.36121543728134]
In this paper, we propose an alternative, the implicit delta method, which works by infinitesimally regularizing the training loss of uncertainty.
We show that the change in the evaluation due to regularization is consistent for the variance of the evaluation estimator, even when the infinitesimal change is approximated by a finite difference.
arXiv Detail & Related papers (2022-11-11T19:34:17Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.