Data Contamination Through the Lens of Time
- URL: http://arxiv.org/abs/2310.10628v1
- Date: Mon, 16 Oct 2023 17:51:29 GMT
- Title: Data Contamination Through the Lens of Time
- Authors: Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White,
Samuel Dooley
- Abstract summary: Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
- Score: 21.933771085956426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent claims about the impressive abilities of large language models (LLMs)
are often supported by evaluating publicly available benchmarks. Since LLMs
train on wide swaths of the internet, this practice raises concerns of data
contamination, i.e., evaluating on examples that are explicitly or implicitly
included in the training data. Data contamination remains notoriously
challenging to measure and mitigate, even with partial attempts like controlled
experimentation of training data, canary strings, or embedding similarities. In
this work, we conduct the first thorough longitudinal analysis of data
contamination in LLMs by using the natural experiment of training cutoffs in
GPT models to look at benchmarks released over time. Specifically, we consider
two code/mathematical problem-solving datasets, Codeforces and Project Euler,
and find statistically significant trends among LLM pass rate vs. GitHub
popularity and release date that provide strong evidence of contamination. By
open-sourcing our dataset, raw results, and evaluation framework, our work
paves the way for rigorous analyses of data contamination in modern models. We
conclude with a discussion of best practices and future steps for publicly
releasing benchmarks in the age of LLMs that train on webscale data.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Probing Language Models for Pre-training Data Detection [11.37731401086372]
We propose to utilize the probing technique for pre-training data detection by examining the model's internal activations.
Our method is simple and effective and leads to more trustworthy pre-training data detection.
arXiv Detail & Related papers (2024-06-03T13:58:04Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI.
LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.
We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in
Closed-Source LLMs [5.310555620116225]
We conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4.
We document the amount of data leaked to these models during the first year after the model's release.
We report that these models have been globally exposed to $sim$4.7M samples from 263 benchmarks.
arXiv Detail & Related papers (2024-02-06T11:54:23Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging
Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities.
Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included.
We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.