DevEval: Evaluating Code Generation in Practical Software Projects
- URL: http://arxiv.org/abs/2401.06401v4
- Date: Wed, 6 Mar 2024 02:16:51 GMT
- Title: DevEval: Evaluating Code Generation in Practical Software Projects
- Authors: Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Zhi Jin, Hao Zhu, Huanyu Liu,
Kaibo Liu, Lecheng Wang, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming
Zhang, Yihong Dong, Yuqi Zhu, Bin Gu, Mengfei Yang
- Abstract summary: We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects.
DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects.
We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
- Score: 52.16841274646796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to evaluate Large Language Models (LLMs) in code generation is an open
question. Many benchmarks have been proposed but are inconsistent with
practical software projects, e.g., unreal program distributions, insufficient
dependencies, and small-scale project contexts. Thus, the capabilities of LLMs
in practical projects are still unclear. In this paper, we propose a new
benchmark named DevEval, aligned with Developers' experiences in practical
projects. DevEval is collected through a rigorous pipeline, containing 2,690
samples from 119 practical projects and covering 10 domains. Compared to
previous benchmarks, DevEval aligns to practical projects in multiple
dimensions, e.g., real program distributions, sufficient dependencies, and
enough-scale project contexts. We assess five popular LLMs on DevEval (e.g.,
gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual
abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo
only is 42 in our experiments. We also discuss the challenges and future
directions of code generation in practical projects. We open-source DevEval and
hope it can facilitate the development of code generation in practical
projects.
Related papers
- VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
We introduce VersiCode, the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions.
We design two dedicated evaluation tasks: version-specific code completion (VSCC) and version-aware code editing (VACE)
Comprehensive experiments are conducted to benchmark the performance of LLMs, revealing the challenging nature of these tasks and VersiCode.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories [83.5195424237358]
Existing benchmarks are poorly aligned with real-world code repositories.
We propose a new benchmark named DevEval, which has three advances.
DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
arXiv Detail & Related papers (2024-05-30T09:03:42Z) - CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios [25.085449990951034]
We introduce CoderUJB, a new benchmark designed to evaluate large language models (LLMs) across diverse Java programming tasks.
Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs.
The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation.
arXiv Detail & Related papers (2024-03-28T10:19:18Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation [2.93322471069531]
We conduct an empirical analysis of conversations in DevGPT, a dataset collected from developers' conversations with ChatGPT.
Our findings indicate that the current practice of using LLM-generated code is typically limited to either demonstrating high-level concepts or providing examples in documentation.
arXiv Detail & Related papers (2024-02-18T20:48:09Z) - Learning code summarization from a small and local dataset [0.0]
Training on project-specific data, and testing on the same project, is a promising idea.
We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient.
We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art.
arXiv Detail & Related papers (2022-06-02T00:16:03Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.