Related papers: Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

URL: http://arxiv.org/abs/2410.21647v4
Date: Tue, 24 Jun 2025 20:49:51 GMT
Title: Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'
Authors: Shanchao Liang, Yiran Hu, Nan Jiang, Lin Tan,
Abstract summary: A number of repository-level code generation benchmarks have emerged to evaluate the capabilities of large language models (LLMs)<n>These benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks.<n>We create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects.
Score: 9.48622608877252
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like HumanEval and MBPP. Thus, a natural question is, would LLMs have similar performance in real world coding tasks as their performance in these benchmarks? Unfortunately, one cannot answer this question, since these benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks. To address these challenges, we create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects and appropriate metrics for evaluating source code. It includes 980 whole-function generation tasks from 11 popular projects, 50.8% of which require repository-level context. REPOCOD includes 314 developer-written test cases per instance for better evaluation. We evaluate ten LLMs on REPOCOD and find that none achieves more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. In addition, we found that retrieval-augmented generation achieves better results than using target function dependencies as context.

Related papers

NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition [16.134058143793304]
This work introduces NoCode-bench, a benchmark designed to evaluate large language models (LLMs) on real-world NL-driven feature addition tasks.<n>A subset of 114 high-quality, human-verified instances, NoCode-bench Verified, ensures reliable evaluation.<n>Our experiments reveal that, despite high token usage, the best LLMs achieve a task success rate of only 15.79%, highlighting challenges in cross-file editing, understanding, and tool calling.
arXiv Detail & Related papers (2025-07-24T06:38:19Z)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
Turning the Tide: Repository-based Code Reflection [52.13709676656648]
We introduce LiveRepoReflection, a benchmark for evaluating code understanding and generation in multi-file repository contexts.<n>1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty.<n>We also create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources.
arXiv Detail & Related papers (2025-07-14T02:36:27Z)
Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models [8.70160958177614]
Program synthesis with Large Language Models (LLMs) suffers from a "near-miss syndrome" We address this with a multi-agent framework called Synthesize, Execute, Instruct, Debug, and Repair (SEIDR) We empirically explore these trade-offs by comparing replace-focused, repair-focused, and hybrid debug strategies.
arXiv Detail & Related papers (2025-03-10T16:56:51Z)
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [26.14778133391999]
FEA-Bench is a benchmark designed to assess the ability of large language models to perform incremental development within code repositories.<n>We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development.
arXiv Detail & Related papers (2025-03-09T16:11:57Z)
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings [70.95565672516979]
Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. CodeElo is a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time.
arXiv Detail & Related papers (2025-01-02T13:49:00Z)
Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors. We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z)
ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code [29.178248778212588]
ComplexCodeEval is a benchmark designed to assess large language models (LLMs) in various development tasks. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.
arXiv Detail & Related papers (2024-09-16T13:43:04Z)
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation [48.11754113512047]
This study includes a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL.
arXiv Detail & Related papers (2024-08-23T16:33:58Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
Generating Unseen Code Tests In Infinitum [1.0674604700001968]
We present a method for creating benchmark variations that generalize across coding tasks and programming languages. We implement one benchmark, called textitauto-regression, for the task of text-to-code generation in Python.
arXiv Detail & Related papers (2024-07-29T08:11:20Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.<n>We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z)
Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over Repository [4.767858874370881]
We introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context.
arXiv Detail & Related papers (2024-04-22T03:52:54Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation [8.575560293086289]
Large language models (LLMs) have shown extraordinary ability in understanding natural language and generating programming code. The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes.
arXiv Detail & Related papers (2023-08-20T18:36:28Z)
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation [20.45045253933097]
We propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator. We show that HumanEval+ is able to catch significant amounts of previously undetected wrong code.
arXiv Detail & Related papers (2023-05-02T05:46:48Z)
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation [70.96868419971756]
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
arXiv Detail & Related papers (2022-11-18T17:20:27Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.