Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models
- URL: http://arxiv.org/abs/2403.04811v1
- Date: Wed, 6 Mar 2024 21:45:35 GMT
- Title: Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models
- Authors: Martin Riddell, Ansong Ni, Arman Cohan
- Abstract summary: Large language models have achieved remarkable performance on various code generation benchmarks.
There have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data.
We show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training.
- Score: 27.24738197172374
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: While large language models have achieved remarkable performance on various
code generation benchmarks, there have been growing concerns regarding
potential contamination of these benchmarks as they may be leaked into
pretraining and finetuning data. While recent work has investigated
contamination in natural language generation and understanding tasks, there has
been less extensive research into how data contamination impacts the evaluation
of code generation, which is critical for understanding the robustness and
reliability of LLMs in programming contexts. In this work, we perform a
comprehensive study of data contamination of popular code generation
benchmarks, and precisely quantify their overlap with pretraining corpus
through both surface-level and semantic-level matching. In our experiments, we
show that there are substantial overlap between popular code generation
benchmarks and open training corpus, and models perform significantly better on
the subset of the benchmarks where similar solutions are seen during training.
We also conduct extensive analysis on the factors that affects model
memorization and generalization, such as model size, problem difficulty, and
question length. We release all resulting files from our matching pipeline for
future research.
Related papers
- A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks [0.0]
We evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation.
Our results suggest that the outcomes observed in these translated benchmarks align well with evaluation metrics used during the training phase.
These initial insights highlight the need for more comprehensive empirical studies.
arXiv Detail & Related papers (2024-11-23T06:40:47Z) - Are Large Language Models Memorizing Bug Benchmarks? [6.640077652362016]
Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair.
A growing concern within the software engineering community is that benchmarks may not reliably reflect true LLM performance due to the risk of data leakage.
We systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks.
arXiv Detail & Related papers (2024-11-20T13:46:04Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language.
We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Data Contamination Through the Lens of Time [21.933771085956426]
Large language models (LLMs) are often supported by evaluating publicly available benchmarks.
This practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data.
We conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models.
arXiv Detail & Related papers (2023-10-16T17:51:29Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Benchmarking Causal Study to Interpret Large Language Models for Source
Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks.
We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.