Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models
- URL: http://arxiv.org/abs/2403.04811v1
- Date: Wed, 6 Mar 2024 21:45:35 GMT
- Title: Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models
- Authors: Martin Riddell, Ansong Ni, Arman Cohan
- Abstract summary: Large language models have achieved remarkable performance on various code generation benchmarks.
There have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data.
We show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training.
- Score: 27.24738197172374
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: While large language models have achieved remarkable performance on various
code generation benchmarks, there have been growing concerns regarding
potential contamination of these benchmarks as they may be leaked into
pretraining and finetuning data. While recent work has investigated
contamination in natural language generation and understanding tasks, there has
been less extensive research into how data contamination impacts the evaluation
of code generation, which is critical for understanding the robustness and
reliability of LLMs in programming contexts. In this work, we perform a
comprehensive study of data contamination of popular code generation
benchmarks, and precisely quantify their overlap with pretraining corpus
through both surface-level and semantic-level matching. In our experiments, we
show that there are substantial overlap between popular code generation
benchmarks and open training corpus, and models perform significantly better on
the subset of the benchmarks where similar solutions are seen during training.
We also conduct extensive analysis on the factors that affects model
memorization and generalization, such as model size, problem difficulty, and
question length. We release all resulting files from our matching pipeline for
future research.
Related papers
- What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language.
We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z) - Benchmarking Causal Study to Interpret Large Language Models for Source
Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks.
We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Evaluating the Robustness of Neural Language Models to Input
Perturbations [7.064032374579076]
In this study, we design and implement various types of character-level and word-level perturbation methods to simulate noisy input texts.
We investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.
The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced.
arXiv Detail & Related papers (2021-08-27T12:31:17Z) - Code to Comment Translation: A Comparative Study on Model Effectiveness
& Errors [19.653423881863834]
Machine translation models are employed to "translate" code snippets into relevant natural language descriptions.
Most evaluations of such models are conducted using automatic reference-based metrics.
We compare three recently proposed source code summarization models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics.
Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy.
arXiv Detail & Related papers (2021-06-15T20:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.