Related papers: Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

URL: http://arxiv.org/abs/2403.08604v3
Date: Sat, 14 Dec 2024 09:45:51 GMT
Title: Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study
Authors: Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, Kai Chen,
Abstract summary: We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
Score: 72.24266814625685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. In this case study, we explore the performance of LLMs across the entire software development lifecycle with DevEval, encompassing stages including software design, environment setup, implementation, acceptance testing, and unit testing. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.

Related papers

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection [15.026084450436976]
We present a study evaluating the performance of large language models (LLMs) on the software vulnerability detection task. We have compiled a dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools.
arXiv Detail & Related papers (2025-03-03T11:56:00Z)
Large Language Models for Code Generation: The Practitioners Perspective [4.946128083535776]
Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. We propose and develop a multi-model unified platform to generate and execute code based on natural language prompts. We conducted a survey with 60 software practitioners from 11 countries across four continents to evaluate the usability, performance, strengths, and limitations of each model.
arXiv Detail & Related papers (2025-01-28T14:52:16Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code [29.178248778212588]
ComplexCodeEval is a benchmark designed to assess large language models (LLMs) in various development tasks. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.
arXiv Detail & Related papers (2024-09-16T13:43:04Z)
Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs. We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
A Survey on Large Language Models for Code Generation [9.555952109820392]
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks. This survey aims to bridge the gap between academia and practical development by providing a comprehensive and up-to-date literature review.
arXiv Detail & Related papers (2024-06-01T17:48:15Z)
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models [49.387195629660994]
Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks.
arXiv Detail & Related papers (2024-04-04T15:49:49Z)
CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios [25.085449990951034]
We introduce CoderUJB, a new benchmark designed to evaluate large language models (LLMs) across diverse Java programming tasks. Our empirical study on this benchmark investigates the coding abilities of various open-source and closed-source LLMs. The findings indicate that while LLMs exhibit strong potential, challenges remain, particularly in non-functional code generation.
arXiv Detail & Related papers (2024-03-28T10:19:18Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming [12.355284125578342]
Large Language Models (LLMs) have become a focal point in modern software development. LLMs offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, each system requires the LLM to be honed to its set of workspaces to ensure the best performance.
arXiv Detail & Related papers (2024-02-22T03:51:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.