DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- URL: http://arxiv.org/abs/2405.19856v1
- Date: Thu, 30 May 2024 09:03:42 GMT
- Title: DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
- Authors: Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li,
- Abstract summary: Existing benchmarks are poorly aligned with real-world code repositories.
We propose a new benchmark named DevEval, which has three advances.
DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
- Score: 83.5195424237358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs' coding abilities in real-world code repositories. For example, in our experiments, the highest Pass@1 of gpt-4-turbo is only 53.04%. We also analyze LLMs' failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs' predictions have been released.
Related papers
- EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations [87.34429475432998]
Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation.
EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage.
This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories.
arXiv Detail & Related papers (2024-10-30T08:57:59Z) - CodeJudge: Evaluating Code Generation with Large Language Models [6.867043179943195]
Large Language Models (LLMs) have shown promising performance in code generation.
How to reliably evaluate code generated by LLMs remains an unresolved problem.
This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases.
arXiv Detail & Related papers (2024-10-03T03:58:03Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories [42.257427142180546]
Existing benchmarks demonstrate poor alignment with real-world code repositories.
EvoCodeBench is an evolving benchmark to avoid data leakage.
Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular Large Language Models.
arXiv Detail & Related papers (2024-03-31T08:10:50Z) - DevEval: Evaluating Code Generation in Practical Software Projects [52.16841274646796]
We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects.
DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects.
We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
arXiv Detail & Related papers (2024-01-12T06:51:30Z) - Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability
of Large Language Model Code Generation [8.575560293086289]
Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code.
The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes.
arXiv Detail & Related papers (2023-08-20T18:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.