An Open Source Data Contamination Report for Large Language Models
- URL: http://arxiv.org/abs/2310.17589v3
- Date: Mon, 29 Jan 2024 02:11:01 GMT
- Title: An Open Source Data Contamination Report for Large Language Models
- Authors: Yucheng Li, Frank Guerin, Chenghua Lin
- Abstract summary: This paper presents an extensive data contamination report for over 15 popular large language models.
We also introduce an open-source pipeline that enables the community to perform contamination analysis on customised data and models.
- Score: 21.553915781660905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data contamination in model evaluation has become increasingly prevalent with
the growing popularity of large language models. It allows models to "cheat"
via memorisation instead of displaying true capabilities. Therefore,
contamination analysis has become an crucial part of reliable model evaluation
to validate results. However, existing contamination analysis is usually
conducted internally by large language model developers and often lacks
transparency and completeness. This paper presents an extensive data
contamination report for over 15 popular large language models across six
popular multiple-choice QA benchmarks. We also introduce an open-source
pipeline that enables the community to perform contamination analysis on
customised data and models. Our experiments reveal varying contamination levels
ranging from 1\% to 45\% across benchmarks, with the contamination degree
increasing rapidly over time. Performance analysis of large language models
indicates that data contamination does not necessarily lead to increased model
metrics: while significant accuracy boosts of up to 14\% and 7\% are observed
on contaminated C-Eval and Hellaswag benchmarks, only a minimal increase is
noted on contaminated MMLU. We also find larger models seem able to gain more
advantages than smaller models on contaminated test sets.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks.
This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications.
We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z) - ConStat: Performance-Based Contamination Detection in Large Language Models [7.305342793164905]
ConStat is a statistical method that reliably detects and quantifies contamination by comparing performance between a primary and reference benchmark relative to a set of reference models.
We demonstrate the effectiveness of ConStat in an extensive evaluation of diverse model architectures, benchmarks, and contamination scenarios.
arXiv Detail & Related papers (2024-05-25T15:36:37Z) - How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI.
LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.
We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - Evading Data Contamination Detection for Language Models is (too) Easy [9.024665800235855]
Large language models can inadvertently lead to contamination with public benchmarks.
We propose a categorization of both model providers and contamination detection methods.
This reveals vulnerabilities in existing methods that we exploit with EAL.
arXiv Detail & Related papers (2024-02-05T09:10:32Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Estimating Contamination via Perplexity: Quantifying Memorisation in
Language Model Evaluation [2.4173424114751114]
We propose a novel method to quantify contamination without the access of the full training set.
Our analysis provides evidence of significant memorisation of recent foundation models in popular reading comprehension, summarisation benchmarks, while multiple choice appears less contaminated.
arXiv Detail & Related papers (2023-09-19T15:02:58Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - CHEER: Rich Model Helps Poor Model via Knowledge Infusion [69.23072792708263]
We develop a knowledge infusion framework named CHEER that can succinctly summarize such rich model into transferable representations.
Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.
arXiv Detail & Related papers (2020-05-21T21:44:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.