Related papers: Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach

Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach

URL: http://arxiv.org/abs/2510.12120v1
Date: Tue, 14 Oct 2025 03:49:30 GMT
Title: Towards Engineering Multi-Agent LLMs: A Protocol-Driven Approach
Authors: Zhenyu Mao, Jacky Keung, Fengji Zhang, Shuo Liu, Yifei Wang, Jialong Li,
Abstract summary: This paper introduces Software Engineering Multi-Agent Protocol (SEMAP), a protocol-layer methodology that instantiates three core SE design principles for multi-agents.<n>In code development, it achieves up to a 69.6% reduction in total failures function-level development and 56.7% for deployment-level development.
Score: 13.760107452858044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing demand for software development has driven interest in automating software engineering (SE) tasks using Large Language Models (LLMs). Recent efforts extend LLMs into multi-agent systems (MAS) that emulate collaborative development workflows, but these systems often fail due to three core deficiencies: under-specification, coordination misalignment, and inappropriate verification, arising from the absence of foundational SE structuring principles. This paper introduces Software Engineering Multi-Agent Protocol (SEMAP), a protocol-layer methodology that instantiates three core SE design principles for multi-agent LLMs: (1) explicit behavioral contract modeling, (2) structured messaging, and (3) lifecycle-guided execution with verification, and is implemented atop Google's Agent-to-Agent (A2A) infrastructure. Empirical evaluation using the Multi-Agent System Failure Taxonomy (MAST) framework demonstrates that SEMAP effectively reduces failures across different SE tasks. In code development, it achieves up to a 69.6% reduction in total failures for function-level development and 56.7% for deployment-level development. For vulnerability detection, SEMAP reduces failure counts by up to 47.4% on Python tasks and 28.2% on C/C++ tasks.

Related papers

MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Evaluating Software Process Models for Multi-Agent Class-Level Code Generation [5.545076518491288]
Large Language Models (LLMs) are increasingly used to automate software development.<n>This work examines how process structure and role shape multi-agent specialization for class-level code generation.
arXiv Detail & Related papers (2025-11-12T22:53:12Z)
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? [71.21547572568655]
AgenTracer-8B is a lightweight failure tracer trained with multi-granular reinforcement learning.<n>On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%.<n>AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains.
arXiv Detail & Related papers (2025-09-03T13:42:14Z)
CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs [16.234259194402163]
We introduce CodeAgents, a prompting framework that codifies multi-agent reasoning and enables structured, token-efficient planning in multi-agent systems.<n>Results show consistent improvements in planning performance, with absolute gains of 3-36 percentage points over natural language prompting baselines.
arXiv Detail & Related papers (2025-07-04T02:20:19Z)
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems [50.29939179830491]
Failure attribution in LLM multi-agent systems remains underexplored and labor-intensive.<n>We develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons.<n>The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps.
arXiv Detail & Related papers (2025-04-30T23:09:44Z)
OmniNova:A General Multimodal Agent Framework [0.5439020425819]
Large Language Models (LLMs) with specialized tools presents new opportunities for intelligent automation systems.<n>We present OmniNova, a modular multi-agent automation framework that combines language models with specialized tools such as web search, crawling, and code execution capabilities.
arXiv Detail & Related papers (2025-03-25T19:21:01Z)
Planning with Multi-Constraints via Collaborative Language Agents [13.550774629515843]
This paper introduces Planning with Multi-Constraints (PMC), a zero-shot methodology for collaborative multi-agent systems.<n>PMC simplifies complex task planning with constraints by decomposing it into a hierarchy of subordinate tasks.<n>PMC achieved an average 42.68% success rate on TravelPlanner, significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on API-Bank by 13.64%.
arXiv Detail & Related papers (2024-05-26T10:33:17Z)
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.