Collaborative LLM Agents for C4 Software Architecture Design Automation
- URL: http://arxiv.org/abs/2510.22787v1
- Date: Sun, 26 Oct 2025 18:43:59 GMT
- Title: Collaborative LLM Agents for C4 Software Architecture Design Automation
- Authors: Kamil Szczepanik, Jarosław A. Chudziak,
- Abstract summary: This study contributes to automated software architecture design and its evaluation methods.<n>We introduce an LLM-based multi-agent system that automates the production of a C4 software architecture model.<n>Tested on five canonical system briefs, the workflow demonstrates fast C4 model creation, sustains high compilation success, and delivers semantic fidelity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software architecture design is a fundamental part of creating every software system. Despite its importance, producing a C4 software architecture model, the preferred notation for such architecture, remains manual and time-consuming. We introduce an LLM-based multi-agent system that automates this task by simulating a dialogue between role-specific experts who analyze requirements and generate the Context, Container, and Component views of the C4 model. Quality is assessed with a hybrid evaluation framework: deterministic checks for structural and syntactic integrity and C4 rule consistency, plus semantic and qualitative scoring via an LLM-as-a-Judge approach. Tested on five canonical system briefs, the workflow demonstrates fast C4 model creation, sustains high compilation success, and delivers semantic fidelity. A comparison of four state-of-the-art LLMs shows different strengths relevant to architectural design. This study contributes to automated software architecture design and its evaluation methods.
Related papers
- Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas [0.0]
This paper introduces Scylla, an evaluation framework for benchmarking agentic coding tools.<n>The key metric is Cost-of-Pass (CoP), which directly quantifies the trade-off between complexity and efficiency.
arXiv Detail & Related papers (2026-02-09T15:06:24Z) - Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production [0.0]
Large language models (LLMs) have demonstrated strong capabilities in open-ended reasoning and generative language tasks.<n>For structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone.<n>We show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance.
arXiv Detail & Related papers (2026-02-06T03:54:28Z) - Evaluating Classical Software Process Models as Coordination Mechanisms for LLM-Based Software Generation [4.583390874772685]
This study explores how traditional software development processes can be adapted as coordination scaffolds for Large Language Model (LLM)-based MAS.<n>We executed 11 diverse software projects under three process models and four GPT variants, totaling 132 runs.<n>Both process model and LLM choice significantly affected system performance.<n>Waterfall was most efficient, V-Model produced the most verbose code, and Agile achieved the highest code quality.
arXiv Detail & Related papers (2025-09-17T13:11:49Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z) - LLM4CMO: Large Language Model-aided Algorithm Design for Constrained Multiobjective Optimization [54.35609820607923]
Large language models (LLMs) offer new opportunities for assisting with algorithm design.<n>We propose LLM4CMO, a novel CMOEA based on a dual-population, two-stage framework.<n>LLMs can serve as efficient co-designers in the development of complex evolutionary optimization algorithms.
arXiv Detail & Related papers (2025-08-16T02:00:57Z) - MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration [20.14573932063689]
We propose MAAD (Multi-Agent Architecture Design), an automated framework that employs a knowledge-driven Multi-Agent System (MAS) for architecture design.<n>MaAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints.<n>Our results show that MAAD's superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports.
arXiv Detail & Related papers (2025-07-28T23:18:25Z) - Bench4KE: Benchmarking Automated Competency Question Generation [1.2512982702508668]
Bench4KE is an API-based benchmarking system for Knowledge Engineering automation.<n>It provides a curated gold standard consisting of CQ datasets from four real-world ontology projects.<n>It uses a suite of similarity metrics to assess the quality of the CQs generated.
arXiv Detail & Related papers (2025-05-30T13:03:42Z) - SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z) - A quantitative framework for evaluating architectural patterns in ML systems [49.1574468325115]
This study proposes a framework for quantitative assessment of architectural patterns in ML systems.<n>We focus on scalability and performance metrics for cost-effective CPU-based inference.
arXiv Detail & Related papers (2025-01-20T15:30:09Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - From Requirements to Architecture: An AI-Based Journey to
Semi-Automatically Generate Software Architectures [2.4150871564195007]
We propose a method to generate software architecture candidates based on requirements using artificial intelligence techniques.
We further envision an automatic evaluation and trade-off analysis of the generated architecture candidates.
arXiv Detail & Related papers (2024-01-25T10:56:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.