ReCode: Robustness Evaluation of Code Generation Models
- URL: http://arxiv.org/abs/2212.10264v1
- Date: Tue, 20 Dec 2022 14:11:31 GMT
- Title: ReCode: Robustness Evaluation of Code Generation Models
- Authors: Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang,
Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia,
Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, Bing Xiang
- Abstract summary: We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
- Score: 90.10436771217243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation models have achieved impressive performance. However, they
tend to be brittle as slight edits to a prompt could lead to very different
generations; these robustness properties, critical for user experience when
deployed in real-life applications, are not well understood. Most existing
works on robustness in text or code tasks have focused on classification, while
robustness in generation tasks is an uncharted area and to date there is no
comprehensive benchmark for robustness in code generation. In this paper, we
propose ReCode, a comprehensive robustness evaluation benchmark for code
generation models. We customize over 30 transformations specifically for code
on docstrings, function and variable names, code syntax, and code format. They
are carefully designed to be natural in real-life coding practice, preserve the
original semantic meaning, and thus provide multifaceted assessments of a
model's robustness performance. With human annotators, we verified that over
90% of the perturbed prompts do not alter the semantic meaning of the original
prompt. In addition, we define robustness metrics for code generation models
considering the worst-case behavior under each type of perturbation, taking
advantage of the fact that executing the generated code can serve as objective
evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well
as function completion tasks derived from them. Interesting observations
include: better robustness for CodeGen over InCoder and GPT-J; models are most
sensitive to syntax perturbations; more challenging robustness evaluation on
MBPP over HumanEval.
Related papers
- GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models [16.6780665807022]
textbfGitChameleon is a novel, manually curated dataset comprising 116 Python code completion problems.
GitChameleon is designed to rigorously assess the ability of modern large language models to generate version-specific code.
arXiv Detail & Related papers (2024-11-05T23:34:06Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy.
Results indicate that MANGO significantly improves the code pass rate based on the strong baselines.
The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z) - JumpCoder: Go Beyond Autoregressive Coder via Online Modification [18.9350072969148]
We introduce JumpCoder, a novel model-agnostic framework that enables human-like online modification and non-sequential generation to augment code LLMs.
The key idea behind JumpCoder is to insert new code into the currently generated code when necessary during generation, which is achieved through an auxiliary infilling model.
arXiv Detail & Related papers (2024-01-15T18:04:29Z) - Stochastic Code Generation [1.7205106391379026]
Large language models pre-trained for code generation can generate high-quality short code but often struggle with generating coherent long code.
This issue is also observed in language modeling for long text generation.
In this study, we investigate whether this technique can be applied to code generation to improve coherence.
arXiv Detail & Related papers (2023-04-14T00:01:05Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.