Related papers: EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

URL: http://arxiv.org/abs/2404.00599v1
Date: Sun, 31 Mar 2024 08:10:50 GMT
Title: EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
Authors: Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin,
Abstract summary: Existing benchmarks demonstrate poor alignment with real-world code repositories. EvoCodeBench is an evolving benchmark to avoid data leakage. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular Large Language Models.
Score: 42.257427142180546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world repositories. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular LLMs (e.g., gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5). Our experiments reveal the coding abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 only is 20.73% in our experiments. We also analyze failed cases and summarize the shortcomings of existing LLMs in EvoCodeBench. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.

Related papers

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations [87.34429475432998]
Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories.
arXiv Detail & Related papers (2024-10-30T08:57:59Z)
Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet' [9.48622608877252]
A number of repository-level code generation benchmarks have emerged to evaluate the capabilities of large language models (LLMs)<n>These benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks.<n>We create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects.
arXiv Detail & Related papers (2024-10-29T01:21:05Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories [83.5195424237358]
Existing benchmarks are poorly aligned with real-world code repositories. We propose a new benchmark named DevEval, which has three advances. DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
arXiv Detail & Related papers (2024-05-30T09:03:42Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process. It incorporates a similarity-based retriever and a pre-trained code language model. It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
Repository-Level Prompt Generation for Large Language Models of Code [28.98699307030983]
We propose a framework that learns to generate example-specific prompts using prompt proposals. The prompt proposals take context from the entire repository. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives.
arXiv Detail & Related papers (2022-06-26T10:51:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.