An Open Source Data Contamination Report for Large Language Models
- URL: http://arxiv.org/abs/2310.17589v3
- Date: Mon, 29 Jan 2024 02:11:01 GMT
- Title: An Open Source Data Contamination Report for Large Language Models
- Authors: Yucheng Li, Frank Guerin, Chenghua Lin
- Abstract summary: This paper presents an extensive data contamination report for over 15 popular large language models.
We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models.
- Score: 21.553915781660905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data contamination in model evaluation has become increasingly prevalent with
the growing popularity of large language models. It allows models to "cheat"
via memorisation instead of displaying true capabilities. Therefore,
contamination analysis has become an crucial part of reliable model evaluation
to validate results. However, existing contamination analysis is usually
conducted internally by large language model developers and often lacks
transparency and completeness. This paper presents an extensive data
contamination report for over 15 popular large language models across six
popular multiple-choice QA benchmarks. We also introduce an open-source
pipeline that enables the community to perform contamination analysis on
customised data and models. Our experiments reveal varying contamination levels
ranging from 1\% to 45\% across benchmarks, with the contamination degree
increasing rapidly over time. Performance analysis of large language models
indicates that data contamination does not necessarily lead to increased model
metrics: while significant accuracy boosts of up to 14\% and 7\% are observed
on contaminated C-Eval and Hellaswag benchmarks, only a minimal increase is
noted on contaminated MMLU. We also find larger models seem able to gain more
advantages than smaller models on contaminated test sets.
Related papers
- PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks.
This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications.
We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z) - ConStat: Performance-Based Contamination Detection in Large Language Models [7.305342793164905]
ConStat is a statistical method that reliably detects and quantifies contamination by comparing performance between a primary and reference benchmark relative to a set of reference models.
We demonstrate the effectiveness of ConStat in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios.
arXiv Detail & Related papers (2024-05-25T15:36:37Z) - Evading Data Contamination Detection for Language Models is (too) Easy [9.024665800235855]
Large language models can inadvertently lead to contamination with public benchmarks.
We propose a categorization of both model providers and contamination detection methods.
This reveals vulnerabilities in existing methods that we exploit with EAL.
arXiv Detail & Related papers (2024-02-05T09:10:32Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Estimating Contamination via Perplexity: Quantifying Memorisation in
Language Model Evaluation [2.4173424114751114]
We propose a novel method to quantify contamination without the access of the full training set.
Our analysis provides evidence of significant memorisation of recent foundation models in popular reading comprehension, summarisation benchmarks, while multiple choice appears less contaminated.
arXiv Detail & Related papers (2023-09-19T15:02:58Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - CHEER: Rich Model Helps Poor Model via Knowledge Infusion [69.23072792708263]
We develop a knowledge infusion framework named CHEER that can succinctly summarize such rich model into transferable representations.
Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.
arXiv Detail & Related papers (2020-05-21T21:44:21Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.