Data Contamination Through the Lens of Time
- URL: http://arxiv.org/abs/2310.10628v1
- Date: Mon, 16 Oct 2023 17:51:29 GMT
- Title: Data Contamination Through the Lens of Time
- Authors: Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White,
Samuel Dooley
- Abstract summary: Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
- Score: 21.933771085956426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent claims about the impressive abilities of large language models (LLMs)
are often supported by evaluating publicly available benchmarks. Since LLMs
train on wide swaths of the internet, this practice raises concerns of data
contamination, i.e., evaluating on examples that are explicitly or implicitly
included in the training data. Data contamination remains notoriously
challenging to measure and mitigate, even with partial attempts like controlled
experimentation of training data, canary strings, or embedding similarities. In
this work, we conduct the first thorough longitudinal analysis of data
contamination in LLMs by using the natural experiment of training cutoffs in
GPT models to look at benchmarks released over time. Specifically, we consider
two code/mathematical problem-solving datasets, Codeforces and Project Euler,
and find statistically significant trends among LLM pass rate vs. GitHub
popularity and release date that provide strong evidence of contamination. By
open-sourcing our dataset, raw results, and evaluation framework, our work
paves the way for rigorous analyses of data contamination in modern models. We
conclude with a discussion of best practices and future steps for publicly
releasing benchmarks in the age of LLMs that train on webscale data.
Related papers
- A Survey on Data Contamination for Large Language Models [12.431575579432458]
Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis.
The reliability of performance evaluation has come under scrutiny due to data contamination.
arXiv Detail & Related papers (2025-02-20T10:23:27Z) - Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.
In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge.
We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in
Closed-Source LLMs [5.310555620116225]
We conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4.
We document the amount of data leaked to these models during the first year after the model's release.
We report that these models have been globally exposed to $sim$4.7M samples from 263 benchmarks.
arXiv Detail & Related papers (2024-02-06T11:54:23Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging
Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities.
Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included.
We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.