Related papers: DevBench: A Comprehensive Benchmark for Software Development

DevBench: A Comprehensive Benchmark for Software Development

URL: http://arxiv.org/abs/2403.08604v2
Date: Fri, 15 Mar 2024 13:23:34 GMT
Title: DevBench: A Comprehensive Benchmark for Software Development
Authors: Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, Kai Chen,
Abstract summary: DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
Score: 72.24266814625685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of programming, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. To this end, we propose DevBench, a comprehensive benchmark that evaluates LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. DevBench features a wide range of programming languages and domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Analyses reveal that models struggle with understanding the complex structures in the repository, managing the compilation process, and grasping advanced programming concepts. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications. Our benchmark is available at https://github.com/open-compass/DevBench

Related papers

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models [49.83481415540291]
Large Language Models (LLMs) have exhibited significant proficiency in code debug.<n>This paper introduces Repo Debug, a multi-task and multi-language repository-level code debug dataset.<n>We conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debug.
arXiv Detail & Related papers (2025-09-04T10:13:21Z)
LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python [0.0]
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development.<n>This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python.
arXiv Detail & Related papers (2025-08-22T14:30:24Z)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models [50.17907898478795]
We introduce BinMetric, a benchmark designed to evaluate the performance of large language models on binary analysis tasks.<n>BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks.<n>Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field.
arXiv Detail & Related papers (2025-05-12T08:54:07Z)
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection [15.026084450436976]
We present a study evaluating the performance of large language models (LLMs) on the software vulnerability detection task. We have compiled a dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools.
arXiv Detail & Related papers (2025-03-03T11:56:00Z)
Large Language Models for Code Generation: The Practitioners Perspective [4.946128083535776]
Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. We propose and develop a multi-model unified platform to generate and execute code based on natural language prompts. We conducted a survey with 60 software practitioners from 11 countries across four continents to evaluate the usability, performance, strengths, and limitations of each model.
arXiv Detail & Related papers (2025-01-28T14:52:16Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code [29.178248778212588]
ComplexCodeEval is a benchmark designed to assess large language models (LLMs) in various development tasks. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.
arXiv Detail & Related papers (2024-09-16T13:43:04Z)
Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs. We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
A Survey on Large Language Models for Code Generation [9.555952109820392]
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks. This survey aims to bridge the gap between academia and practical development by providing a comprehensive and up-to-date literature review.
arXiv Detail & Related papers (2024-06-01T17:48:15Z)
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models [49.387195629660994]
Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks.
arXiv Detail & Related papers (2024-04-04T15:49:49Z)
CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios [25.085449990951034]
We introduce CoderUJB, a new benchmark designed to evaluate large language models (LLMs) across diverse Java programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation.
arXiv Detail & Related papers (2024-03-28T10:19:18Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming [12.355284125578342]
Large Language Models (LLMs) have become a focal point in modern software development. LLMs offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, each system requires the LLM to be honed to its set of workspaces to ensure the best performance.
arXiv Detail & Related papers (2024-02-22T03:51:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.