Related papers: Collaborative LLM Agents for C4 Software Architecture Design Automation

Collaborative LLM Agents for C4 Software Architecture Design Automation

URL: http://arxiv.org/abs/2510.22787v1
Date: Sun, 26 Oct 2025 18:43:59 GMT
Title: Collaborative LLM Agents for C4 Software Architecture Design Automation
Authors: Kamil Szczepanik, Jarosław A. Chudziak,
Abstract summary: This study contributes to automated software architecture design and its evaluation methods.<n>We introduce an LLM-based multi-agent system that automates the production of a C4 software architecture model.<n>Tested on five canonical system briefs, the workflow demonstrates fast C4 model creation, sustains high compilation success, and delivers semantic fidelity.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software architecture design is a fundamental part of creating every software system. Despite its importance, producing a C4 software architecture model, the preferred notation for such architecture, remains manual and time-consuming. We introduce an LLM-based multi-agent system that automates this task by simulating a dialogue between role-specific experts who analyze requirements and generate the Context, Container, and Component views of the C4 model. Quality is assessed with a hybrid evaluation framework: deterministic checks for structural and syntactic integrity and C4 rule consistency, plus semantic and qualitative scoring via an LLM-as-a-Judge approach. Tested on five canonical system briefs, the workflow demonstrates fast C4 model creation, sustains high compilation success, and delivers semantic fidelity. A comparison of four state-of-the-art LLMs shows different strengths relevant to architectural design. This study contributes to automated software architecture design and its evaluation methods.

Related papers

Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas [0.0]
This paper introduces Scylla, an evaluation framework for benchmarking agentic coding tools.<n>The key metric is Cost-of-Pass (CoP), which directly quantifies the trade-off between complexity and efficiency.
arXiv Detail & Related papers (2026-02-09T15:06:24Z)
Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production [0.0]
Large language models (LLMs) have demonstrated strong capabilities in open-ended reasoning and generative language tasks.<n>For structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone.<n>We show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance.
arXiv Detail & Related papers (2026-02-06T03:54:28Z)
Evaluating Classical Software Process Models as Coordination Mechanisms for LLM-Based Software Generation [4.583390874772685]
This study explores how traditional software development processes can be adapted as coordination scaffolds for Large Language Model (LLM)-based MAS.<n>We executed 11 diverse software projects under three process models and four GPT variants, totaling 132 runs.<n>Both process model and LLM choice significantly affected system performance.<n>Waterfall was most efficient, V-Model produced the most verbose code, and Agile achieved the highest code quality.
arXiv Detail & Related papers (2025-09-17T13:11:49Z)
Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z)
LLM4CMO: Large Language Model-aided Algorithm Design for Constrained Multiobjective Optimization [54.35609820607923]
Large language models (LLMs) offer new opportunities for assisting with algorithm design.<n>We propose LLM4CMO, a novel CMOEA based on a dual-population, two-stage framework.<n>LLMs can serve as efficient co-designers in the development of complex evolutionary optimization algorithms.
arXiv Detail & Related papers (2025-08-16T02:00:57Z)
MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration [20.14573932063689]
We propose MAAD (Multi-Agent Architecture Design), an automated framework that employs a knowledge-driven Multi-Agent System (MAS) for architecture design.<n>MaAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints.<n>Our results show that MAAD's superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports.
arXiv Detail & Related papers (2025-07-28T23:18:25Z)
Bench4KE: Benchmarking Automated Competency Question Generation [1.2512982702508668]
Bench4KE is an API-based benchmarking system for Knowledge Engineering automation.<n>It provides a curated gold standard consisting of CQ datasets from four real-world ontology projects.<n>It uses a suite of similarity metrics to assess the quality of the CQs generated.
arXiv Detail & Related papers (2025-05-30T13:03:42Z)
SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
A quantitative framework for evaluating architectural patterns in ML systems [49.1574468325115]
This study proposes a framework for quantitative assessment of architectural patterns in ML systems.<n>We focus on scalability and performance metrics for cost-effective CPU-based inference.
arXiv Detail & Related papers (2025-01-20T15:30:09Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
From Requirements to Architecture: An AI-Based Journey to Semi-Automatically Generate Software Architectures [2.4150871564195007]
We propose a method to generate software architecture candidates based on requirements using artificial intelligence techniques. We further envision an automatic evaluation and trade-off analysis of the generated architecture candidates.
arXiv Detail & Related papers (2024-01-25T10:56:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.