Related papers: Data Contamination Through the Lens of Time

Data Contamination Through the Lens of Time

URL: http://arxiv.org/abs/2310.10628v1
Date: Mon, 16 Oct 2023 17:51:29 GMT
Title: Data Contamination Through the Lens of Time
Authors: Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, Samuel Dooley
Abstract summary: Large language models (LLMs) are often supported by evaluating publicly available benchmarks. This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
Score: 21.933771085956426
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

Related papers

Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination [18.006532081289627]
We propose tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination. tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations. Results show that tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.
arXiv Detail & Related papers (2025-03-06T06:56:59Z)
A Survey on Data Contamination for Large Language Models [12.431575579432458]
Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. The reliability of performance evaluation has come under scrutiny due to data contamination.
arXiv Detail & Related papers (2025-02-20T10:23:27Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge [68.39683427262335]
Existing studies fail to guarantee contamination-free evaluation as newly collected data may contain pre-existing knowledge. We propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.
arXiv Detail & Related papers (2024-12-18T09:53:12Z)
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions [20.51842378080194]
Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. As LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination. We systematically review 50 papers on data contamination detection, categorize the underlying assumptions, and assess whether they have been rigorously validated.
arXiv Detail & Related papers (2024-10-24T17:58:22Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Probing Language Models for Pre-training Data Detection [11.37731401086372]
We propose to utilize the probing technique for pre-training data detection by examining the model's internal activations. Our method is simple and effective and leads to more trustworthy pre-training data detection.
arXiv Detail & Related papers (2024-06-03T13:58:04Z)
Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark. We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z)
How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI. LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs [5.310555620116225]
We conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4. We document the amount of data leaked to these models during the first year after the model's release. We report that these models have been globally exposed to $sim$4.7M samples from 263 benchmarks.
arXiv Detail & Related papers (2024-02-06T11:54:23Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities. Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.