On Leakage of Code Generation Evaluation Datasets
- URL: http://arxiv.org/abs/2407.07565v3
- Date: Thu, 3 Oct 2024 16:48:55 GMT
- Title: On Leakage of Code Generation Evaluation Datasets
- Authors: Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé,
- Abstract summary: We consider contamination by code generation test sets, in particular in their use in modern large language models.
To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions.
- Score: 44.4726918027046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .
Related papers
- Leveraging Large Language Models in Code Question Answering: Baselines and Issues [0.1617522438111378]
This paper presents a work devoted to using large language models for question answering over source code in Python.
The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code.
We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis.
arXiv Detail & Related papers (2024-11-05T11:25:12Z) - CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow [10.19019476978683]
dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests.
Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation.
arXiv Detail & Related papers (2024-09-25T11:18:52Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish [47.3916421056009]
Large Language Models (LLMs) are trained on massive web-crawled corpora.
LLMs produce leaked information in most cases despite less such data in their training set.
Self-detection method showed superior performance compared to existing detection methods.
arXiv Detail & Related papers (2024-03-24T13:21:58Z) - LatestEval: Addressing Data Contamination in Language Model Evaluation
through Dynamic and Time-Sensitive Test Construction [21.553915781660905]
LatestEval is an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations.
It avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models.
Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks.
arXiv Detail & Related papers (2023-12-19T17:16:43Z) - Rethinking Benchmark and Contamination for Language Models with
Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans.
Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z) - Towards Mitigating more Challenging Spurious Correlations: A Benchmark & New Datasets [43.64631697043496]
Deep neural networks often exploit non-predictive features that are spuriously correlated with class labels.
Despite the growing body of recent works on remedying spurious correlations, the lack of a standardized benchmark hinders reproducible evaluation.
We present SpuCo, a python package with modular implementations of state-of-the-art solutions enabling easy and reproducible evaluation.
arXiv Detail & Related papers (2023-06-21T00:59:06Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.