Related papers: Evaluating Software Process Models for Multi-Agent Class-Level Code Generation

Evaluating Software Process Models for Multi-Agent Class-Level Code Generation

URL: http://arxiv.org/abs/2511.09794v1
Date: Fri, 14 Nov 2025 01:10:06 GMT
Title: Evaluating Software Process Models for Multi-Agent Class-Level Code Generation
Authors: Wasique Islam Shafin, Md Nakhla Rafi, Zhenhao Li, Tse-Hsun Chen,
Abstract summary: Large Language Models (LLMs) are increasingly used to automate software development.<n>This work examines how process structure and role shape multi-agent specialization for class-level code generation.
Score: 5.545076518491288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern software systems require code that is not only functional but also maintainable and well-structured. Although Large Language Models (LLMs) are increasingly used to automate software development, most studies focus on isolated, single-agent function-level generation. This work examines how process structure and role specialization shape multi-agent LLM workflows for class-level code generation. We simulate a Waterfall-style development cycle covering Requirement, Design, Implementation, and Testing using three LLMs (GPT-4o-mini, DeepSeek-Chat, and Claude-3.5-Haiku) on 100 Python tasks from the ClassEval benchmark. Our findings show that multi-agent workflows reorganize, rather than consistently enhance, model performance. Waterfall-style collaboration produces cleaner and more maintainable code but often reduces functional correctness (-37.8\% for GPT-4o-mini and -39.8\% for DeepSeek-Chat), with Claude-3.5-Haiku as a notable exception (+9.5\%). Importantly, process constraints shift failure characteristics: structural issues such as missing code decrease, while semantic and validation errors become more frequent. Among all stages, Testing exerts the strongest influence by improving verification coverage but also introducing new reasoning failures, whereas Requirement and Design have comparatively modest effects. Overall, this study provides empirical evidence that software process structure fundamentally alters how LLMs reason, collaborate, and fail, revealing inherent trade-offs between rigid workflow discipline and flexible problem-solving in multi-agent code generation.

Related papers

Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z)
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence [150.3696990310269]
Large language models (LLMs) have transformed automated software development by enabling direct translation of natural language descriptions into functional code.<n>We provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs.<n>We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder)
arXiv Detail & Related papers (2025-11-23T17:09:34Z)
Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs [12.70863561286374]
We introduce a lifecycle-aware framework that incorporates intermediate artifacts into both the training and inference stages.<n> Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning.<n>Open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code.
arXiv Detail & Related papers (2025-10-28T02:54:02Z)
Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach [13.760107452858044]
This paper introduces Software Engineering Multi-Agent Protocol (SEMAP), a protocol-layer methodology that instantiates three core SE design principles for multi-agents.<n>In code development, it achieves up to a 69.6% reduction in total failures function-level development and 56.7% for deployment-level development.
arXiv Detail & Related papers (2025-10-14T03:49:30Z)
Benchmarking Correctness and Security in Multi-Turn Code Generation [41.75392001830794]
We introduce MTSec, the first benchmark to evaluate correctness and security in multi-turn coding scenarios.<n>We evaluate 32 open- and closed-source models, and three agent-scaffolding on MT-Sec.<n>We find that while agent-generated scaffoldings boost single-turn code generation performance, they are not quite as effective in multiturn evaluations.
arXiv Detail & Related papers (2025-10-13T01:20:46Z)
Evaluating Classical Software Process Models as Coordination Mechanisms for LLM-Based Software Generation [4.583390874772685]
This study explores how traditional software development processes can be adapted as coordination scaffolds for Large Language Model (LLM)-based MAS.<n>We executed 11 diverse software projects under three process models and four GPT variants, totaling 132 runs.<n>Both process model and LLM choice significantly affected system performance.<n>Waterfall was most efficient, V-Model produced the most verbose code, and Agile achieved the highest code quality.
arXiv Detail & Related papers (2025-09-17T13:11:49Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.