Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in
Closed-Source LLMs
- URL: http://arxiv.org/abs/2402.03927v2
- Date: Thu, 22 Feb 2024 12:32:24 GMT
- Title: Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in
Closed-Source LLMs
- Authors: Simone Balloccu, Patr\'icia Schmidtov\'a, Mateusz Lango, and
Ond\v{r}ej Du\v{s}ek
- Abstract summary: We conduct the first systematic analysis of work using OpenAI's GPT-3.5 and GPT-4.
We document the amount of data leaked to these models during the first year after the model's release.
We report that these models have been globally exposed to $sim$4.7M samples from 263 benchmarks.
- Score: 5.310555620116225
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural Language Processing (NLP) research is increasingly focusing on the
use of Large Language Models (LLMs), with some of the most popular ones being
either fully or partially closed-source. The lack of access to model details,
especially regarding training data, has repeatedly raised concerns about data
contamination among researchers. Several attempts have been made to address
this issue, but they are limited to anecdotal evidence and trial and error.
Additionally, they overlook the problem of \emph{indirect} data leaking, where
models are iteratively improved by using data coming from users. In this work,
we conduct the first systematic analysis of work using OpenAI's GPT-3.5 and
GPT-4, the most prominently used LLMs today, in the context of data
contamination. By analysing 255 papers and considering OpenAI's data usage
policy, we extensively document the amount of data leaked to these models
during the first year after the model's release. We report that these models
have been globally exposed to $\sim$4.7M samples from 263 benchmarks. At the
same time, we document a number of evaluation malpractices emerging in the
reviewed papers, such as unfair or missing baseline comparisons and
reproducibility issues. We release our results as a collaborative project on
https://leak-llm.github.io/, where other researchers can contribute to our
efforts.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library [68.10605098856087]
Large Language Models (LLMs) are increasingly being used in business applications and fundraising in AI.
LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data.
We release an open-source Python library named LLMSanitize implementing major contamination detection algorithms.
arXiv Detail & Related papers (2024-03-31T14:32:02Z) - Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models [1.443696537295348]
Privacy leakage and copyright violation are still underexplored.
Our unlearning algorithms are not only data-agnostic/model-agnostic but also proven to be robust in terms of utility preservation or privacy guarantee.
arXiv Detail & Related papers (2024-03-13T18:57:30Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging
Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities.
Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included.
We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z) - Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Data Contamination: From Memorization to Exploitation [5.997909991352044]
It is not clear to what extent models exploit contaminated data for downstream tasks.
We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task.
Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it.
arXiv Detail & Related papers (2022-03-15T20:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.