Related papers: On Leakage of Code Generation Evaluation Datasets

On Leakage of Code Generation Evaluation Datasets

URL: http://arxiv.org/abs/2407.07565v3
Date: Thu, 3 Oct 2024 16:48:55 GMT
Title: On Leakage of Code Generation Evaluation Datasets
Authors: Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé,
Abstract summary: We consider contamination by code generation test sets, in particular in their use in modern large language models. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions.
Score: 44.4726918027046
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. To address this, we release Less Basic Python Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://huggingface.co/datasets/CohereForAI/lbpp .

Related papers

Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection [5.383910843560784]
PyResBugs is a curated dataset of residual bugs from major Python frameworks.<n>Each bug is paired with its corresponding fault-free (fixed) version and annotated with multi-level natural language (NL) descriptions.
arXiv Detail & Related papers (2025-05-09T04:39:09Z)
A Survey on Data Contamination for Large Language Models [12.431575579432458]
Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. The reliability of performance evaluation has come under scrutiny due to data contamination.
arXiv Detail & Related papers (2025-02-20T10:23:27Z)
Leveraging Large Language Models in Code Question Answering: Baselines and Issues [0.1617522438111378]
This paper presents a work devoted to using large language models for question answering over source code in Python. The proposed method for implementing a source code question answering system involves fine-tuning a large language model on a unified dataset of questions and answers for Python code. We report BLEU-4, BERTScore F1, BLEURT, and Exact Match metric values, along with the conclusions from the manual error analysis.
arXiv Detail & Related papers (2024-11-05T11:25:12Z)
CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow [10.19019476978683]
dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation.
arXiv Detail & Related papers (2024-09-25T11:18:52Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain. We propose an adversarial algorithm to make the retriever component robust against distribution shift. We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z)
A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish [47.3916421056009]
Large Language Models (LLMs) are trained on massive web-crawled corpora. LLMs produce leaked information in most cases despite less such data in their training set. Self-detection method showed superior performance compared to existing detection methods.
arXiv Detail & Related papers (2024-03-24T13:21:58Z)
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction [21.553915781660905]
LatestEval is an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. It avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks.
arXiv Detail & Related papers (2023-12-19T17:16:43Z)
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [49.18977581962162]
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
arXiv Detail & Related papers (2023-11-08T17:35:20Z)
Towards Mitigating more Challenging Spurious Correlations: A Benchmark & New Datasets [43.64631697043496]
Deep neural networks often exploit non-predictive features that are spuriously correlated with class labels. Despite the growing body of recent works on remedying spurious correlations, the lack of a standardized benchmark hinders reproducible evaluation. We present SpuCo, a python package with modular implementations of state-of-the-art solutions enabling easy and reproducible evaluation.
arXiv Detail & Related papers (2023-06-21T00:59:06Z)
The Gap on GAP: Tackling the Problem of Differing Data Distributions in Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing. undesired patterns in the collected data can make such tests incorrect. We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.