Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples
- URL: http://arxiv.org/abs/2311.04850v2
- Date: Sat, 11 Nov 2023 05:11:18 GMT
- Title: Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples
- Authors: Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion
Stoica
- Abstract summary: Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
- Score: 49.18977581962162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are increasingly trained on all the data ever produced
by humans. Many have raised concerns about the trustworthiness of public
benchmarks due to potential contamination in pre-training or fine-tuning
datasets. While most data decontamination efforts apply string matching (e.g.,
n-gram overlap) to remove benchmark data, we show that these methods are
insufficient, and simple variations of test data (e.g., paraphrasing,
translation) can easily bypass these decontamination measures. Furthermore, we
demonstrate that if such variation of test data is not eliminated, a 13B model
can easily overfit a test benchmark and achieve drastically high performance,
on par with GPT-4. We validate such observations in widely used benchmarks such
as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a
stronger LLM-based decontamination method and apply it to widely used
pre-training and fine-tuning datasets, revealing significant previously unknown
test overlap. For example, in pre-training sets such as RedPajama-Data-1T and
StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps.
Interestingly, we also find such contamination in synthetic dataset generated
by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We
urge the community to adopt stronger decontamination approaches when using
public benchmarks. Moreover, we call for the community to actively develop
fresh one-time exams to evaluate models accurately. Our decontamination tool is
publicly available at https://github.com/lm-sys/llm-decontaminator.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks.
This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications.
We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z) - Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance.
We propose Inference-Time Decontamination (ITD) to address this issue.
ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z) - Evading Data Contamination Detection for Language Models is (too) Easy [9.024665800235855]
Large language models can inadvertently lead to contamination with public benchmarks.
We propose a categorization of both model providers and contamination detection methods.
This reveals vulnerabilities in existing methods that we exploit with EAL.
arXiv Detail & Related papers (2024-02-05T09:10:32Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination
for each Benchmark [19.875954121100005]
We argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble.
The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark.
This position paper defines different levels of data contamination and argues for a community effort.
arXiv Detail & Related papers (2023-10-27T09:48:29Z) - Proving Test Set Contamination in Black Box Language Models [20.576866080360247]
We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights.
Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely.
arXiv Detail & Related papers (2023-10-26T17:43:13Z) - Detecting Pretraining Data from Large Language Models [90.12037980837738]
We study the pretraining data detection problem.
Given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text?
We introduce a new detection method Min-K% Prob based on a simple hypothesis.
arXiv Detail & Related papers (2023-10-25T17:21:23Z) - Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.