DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- URL: http://arxiv.org/abs/2405.19856v1
- Date: Thu, 30 May 2024 09:03:42 GMT
- Title: DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- Authors: Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li,
- Abstract summary: Existing benchmarks are poorly aligned with real-world code repositories.
We propose a new benchmark named DevEval, which has three advances.
DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
- Score: 83.5195424237358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.
Related papers
- What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark [5.641402231731082]
We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale.
RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code.
arXiv Detail & Related papers (2024-06-17T10:45:22Z) - EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories [42.257427142180546]
Existing benchmarks demonstrate poor alignment with real-world code repositories.
EvoCodeBench is an evolving benchmark to avoid data leakage.
Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular Large Language Models.
arXiv Detail & Related papers (2024-03-31T08:10:50Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - StarCoder 2 and The Stack v2: The Next Generation [105.93298676368798]
We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens.
We thoroughly evaluate them on a comprehensive set of Code LLM benchmarks.
Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size.
arXiv Detail & Related papers (2024-02-29T13:53:35Z) - DevEval: Evaluating Code Generation in Practical Software Projects [52.16841274646796]
We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects.
DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects.
We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
arXiv Detail & Related papers (2024-01-12T06:51:30Z) - A Review of Repository Level Prompting for LLMs [0.0]
Large Language Models (LLMs) have led to notable successes, such as achieving a 94.6% solve rate on the HumanEval benchmark.
There is an increasing commercial push for repository-level inline code completion tools, such as GitHub Copilot and Tab Nine.
This paper delves into the transition from individual coding problems to repository-scale solutions.
arXiv Detail & Related papers (2023-12-15T00:34:52Z) - Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability
of Large Language Model Code Generation [8.575560293086289]
Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code.
The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes.
arXiv Detail & Related papers (2023-08-20T18:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.