DevEval: Evaluating Code Generation in Practical Software Projects
- URL: http://arxiv.org/abs/2401.06401v4
- Date: Wed, 6 Mar 2024 02:16:51 GMT
- Title: DevEval: Evaluating Code Generation in Practical Software Projects
- Authors: Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Zhi Jin, Hao Zhu, Huanyu Liu,
Kaibo Liu, Lecheng Wang, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming
Zhang, Yihong Dong, Yuqi Zhu, Bin Gu, Mengfei Yang
- Abstract summary: We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects.
DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects.
We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
- Score: 52.16841274646796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to evaluate Large Language Models (LLMs) in code generation is an open
question. Many benchmarks have been proposed but are inconsistent with
practical software projects, e.g., unreal program distributions, insufficient
dependencies, and small-scale project contexts. Thus, the capabilities of LLMs
in practical projects are still unclear. In this paper, we propose a new
benchmark named DevEval, aligned with Developers' experiences in practical
projects. DevEval is collected through a rigorous pipeline, containing 2,690
samples from 119 practical projects and covering 10 domains. Compared to
previous benchmarks, DevEval aligns to practical projects in multiple
dimensions, e.g., real program distributions, sufficient dependencies, and
enough-scale project contexts. We assess five popular LLMs on DevEval (e.g.,
gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual
abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo
only is 42 in our experiments. We also discuss the challenges and future
directions of code generation in practical projects. We open-source DevEval and
hope it can facilitate the development of code generation in practical
projects.
Related papers
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories [83.5195424237358]
Existing benchmarks are poorly aligned with real-world code repositories.
We propose a new benchmark named DevEval, which has three advances.
DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
arXiv Detail & Related papers (2024-05-30T09:03:42Z) - CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios [25.085449990951034]
We introduce CoderUJB, a new benchmark designed to evaluate large language models (LLMs) across diverse Java programming tasks.
Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs.
The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation.
arXiv Detail & Related papers (2024-03-28T10:19:18Z) - SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents.
We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation [2.93322471069531]
We conduct an empirical analysis of conversations in DevGPT, a dataset collected from developers' conversations with ChatGPT.
Our findings indicate that the current practice of using LLM-generated code is typically limited to either demonstrating high-level concepts or providing examples in documentation.
arXiv Detail & Related papers (2024-02-18T20:48:09Z) - Learning code summarization from a small and local dataset [0.0]
Training on project-specific data, and testing on the same project, is a promising idea.
We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient.
We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art.
arXiv Detail & Related papers (2022-06-02T00:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.