CoderEval: A Benchmark of Pragmatic Code Generation with Generative
Pre-trained Models
- URL: http://arxiv.org/abs/2302.00288v3
- Date: Fri, 23 Feb 2024 08:29:16 GMT
- Title: CoderEval: A Benchmark of Pragmatic Code Generation with Generative
Pre-trained Models
- Authors: Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai
Liang, Ying Li, Qianxiang Wang, Tao Xie
- Abstract summary: We propose a benchmark named CoderEval, consisting of 230 Python and 230 Java code generation tasks.
By evaluating three code generation models on CoderEval, we find that the effectiveness of these models in generating standalone functions is substantially higher than that in generating non-standalone functions.
- Score: 20.169432642273524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code generation models based on the pre-training and fine-tuning paradigm
have been increasingly attempted by both academia and industry, resulting in
well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To
evaluate the effectiveness of these models, multiple existing benchmarks are
proposed, including only cases of generating a standalone function, i.e., a
function that may invoke or access only built-in functions and standard
libraries. However, non-standalone functions, which typically are not included
in the existing benchmarks, constitute more than 70% of the functions in
popular open-source projects, and evaluating models' effectiveness on
standalone functions cannot reflect these models' effectiveness on pragmatic
code generation scenarios.
To help bridge the preceding gap, in this paper, we propose a benchmark named
CoderEval, consisting of 230 Python and 230 Java code generation tasks
carefully curated from popular real-world open-source projects and a
self-contained execution platform to automatically assess the functional
correctness of generated code. CoderEval supports code generation tasks from
six levels of context dependency, where context refers to code elements such as
types, APIs, variables, and consts defined outside the function under
generation but within the dependent third-party libraries, current class, file,
or project. CoderEval can be used to evaluate the effectiveness of models in
generating code beyond only standalone functions. By evaluating three code
generation models on CoderEval, we find that the effectiveness of these models
in generating standalone functions is substantially higher than that in
generating non-standalone functions. Our analysis highlights the current
progress and pinpoints future directions to further improve a model's
effectiveness by leveraging contextual information for pragmatic code
generation.
Related papers
- OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique [59.18475981916166]
We introduce OpenCodeReasoning-II, a dataset consisting of 2.5M question-solution-critique triples (approx. 35K unique programming questions)<n>In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance.
arXiv Detail & Related papers (2025-07-11T23:35:54Z) - An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities [19.455889970335967]
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions.
One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code.
Retrieval-augmented framework can be leveraged to help understand the requirements and provide guidance for the generation process.
arXiv Detail & Related papers (2025-01-23T15:17:51Z) - See-Saw Generative Mechanism for Scalable Recursive Code Generation with Generative AI [0.0]
This paper introduces the See-Saw generative mechanism, a novel methodology for dynamic and iterative code generation.
The proposed approach alternates between main code updates and dependency generation to ensure alignment and functionality.
The mechanism ensures that all code components are synchronized and functional, enabling scalable and efficient project generation.
arXiv Detail & Related papers (2024-11-16T18:54:56Z) - CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.
CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z) - RepoMasterEval: Evaluating Code Completion via Real-World Repositories [12.176098357240095]
RepoMasterEval is a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories.
To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases.
Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark.
arXiv Detail & Related papers (2024-08-07T03:06:57Z) - On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present textbfmethodnamews, a novel benchmark designed to evaluate repository-level code generation.
We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Execution-based Code Generation using Deep Reinforcement Learning [8.085533911328577]
PPOCoder is a new framework for code generation that combines pre-trained PL models with Proximal Policy Optimization.
PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process.
It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs.
arXiv Detail & Related papers (2023-01-31T18:02:26Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - Incorporating Domain Knowledge through Task Augmentation for Front-End
JavaScript Code Generation [10.75138604869187]
In some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data.
We propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model.
Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset.
arXiv Detail & Related papers (2022-08-22T06:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.