CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
- URL: http://arxiv.org/abs/2404.00566v4
- Date: Wed, 02 Oct 2024 23:38:34 GMT
- Title: CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
- Authors: Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, Carolyn Rose,
- Abstract summary: We present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources.
Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples.
To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans.
- Score: 35.68087697258125
- License:
- Abstract: To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide code and data at: https://github.com/yiqingxyq/CodeBenchGen.
Related papers
- CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - GenX: Mastering Code and Test Generation with Execution Feedback [7.225594526057816]
We propose a novel approach that concurrently trains a code generation model and a test generation model.
We introduce two strategies for test and code data augmentation and a new scoring function for code and test ranking.
The results demonstrate that our models, when iteratively trained with an increasing number of test cases and code solutions, outperform those trained on the original dataset.
arXiv Detail & Related papers (2024-12-18T03:18:21Z) - Commit0: Library Generation from Scratch [77.38414688148006]
Commit0 is a benchmark that challenges AI agents to write libraries from scratch.
Agents are provided with a specification document outlining the library's API as well as a suite of interactive unit tests.
Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate.
arXiv Detail & Related papers (2024-12-02T18:11:30Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code [29.178248778212588]
ComplexCodeEval is a benchmark designed to assess large language models (LLMs) in various development tasks.
It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.
arXiv Detail & Related papers (2024-09-16T13:43:04Z) - RepoMasterEval: Evaluating Code Completion via Real-World Repositories [12.176098357240095]
RepoMasterEval is a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories.
To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases.
Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark.
arXiv Detail & Related papers (2024-08-07T03:06:57Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases.
We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests.
We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z) - CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases.
CodeT executes the code solutions using the generated test cases, and then chooses the best solution.
We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.