Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications
- URL: http://arxiv.org/abs/2506.10467v4
- Date: Sat, 19 Jul 2025 20:15:06 GMT
- Title: Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications
- Authors: Felix Härer,
- Abstract summary: LLMs can be used to solve complex tasks by combining reasoning techniques, code generation, and software execution across multiple, potentially specialized LLMs.<n>This paper introduces an agent schema language and the execution and evaluation of the specifications through a multi-agent system architecture and prototype.<n>Test cases involving cybersecurity tasks indicate the feasibility of the architecture and evaluation approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in LLMs indicate potential for novel applications, as evidenced by the reasoning capabilities in the latest OpenAI and DeepSeek models. To apply these models to domain-specific applications beyond text generation, LLM-based multi-agent systems can be utilized to solve complex tasks, particularly by combining reasoning techniques, code generation, and software execution across multiple, potentially specialized LLMs. However, while many evaluations are performed on LLMs, reasoning techniques, and applications individually, their joint specification and combined application are not well understood. Defined specifications for multi-agent LLM systems are required to explore their potential and suitability for specific applications, allowing for systematic evaluations of LLMs, reasoning techniques, and related aspects. This paper reports the results of exploratory research on (1.) multi-agent specification by introducing an agent schema language and (2.) the execution and evaluation of the specifications through a multi-agent system architecture and prototype. The specification language, system architecture, and prototype are first presented in this work, building on an LLM system from prior research. Test cases involving cybersecurity tasks indicate the feasibility of the architecture and evaluation approach. As a result, evaluations could be demonstrated for question answering, server security, and network security tasks completed correctly by agents with LLMs from OpenAI and DeepSeek.
Related papers
- Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z) - AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z) - Software Architecture Meets LLMs: A Systematic Literature Review [4.28281840272851]
We present a systematic literature review on the use of Large Language Models in software architecture.<n>Our findings show that while LLMs are increasingly applied to a variety of software architecture tasks, some areas, such as generating source code from architectural design, remain underexplored.
arXiv Detail & Related papers (2025-05-22T14:00:29Z) - JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation [3.6946337486060776]
JARVIS is a novel multi-agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high-quality scripts for EDA tasks.<n>By combining a domain-specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retrieval mechanisms, our approach achieves significant improvements over state-of-the-art domain-specific models.
arXiv Detail & Related papers (2025-05-20T23:40:57Z) - Exploring LLMs for Verifying Technical System Specifications Against Requirements [41.19948826527649]
The field of knowledge-based requirements engineering (KBRE) aims to support engineers by providing knowledge to assist in the elicitation, validation, and management of system requirements.
The advent of large language models (LLMs) opens new opportunities in the field of KBRE.
This work experimentally investigates the potential of LLMs in requirements verification.
arXiv Detail & Related papers (2024-11-18T13:59:29Z) - From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future [15.568939568441317]
We investigate the current practice and solutions for large language models (LLMs) and LLM-based agents for software engineering.<n>In particular we summarise six key topics: requirement engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance.<n>We discuss the models and benchmarks used, providing a comprehensive analysis of their applications and effectiveness in software engineering.
arXiv Detail & Related papers (2024-08-05T14:01:15Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - LLMmap: Fingerprinting For Large Language Models [15.726286532500971]
With as few as 8 interactions, LLMmap can accurately identify 42 different LLM versions with over 95% accuracy.<n>We discuss potential mitigations and demonstrate that, against resourceful adversaries, effective countermeasures may be challenging or even unrealizable.
arXiv Detail & Related papers (2024-07-22T17:59:45Z) - DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems [99.17123445211115]
We introduce DocBench, a benchmark to evaluate large language model (LLM)-based document reading systems.
Our benchmark involves the recruitment of human annotators and the generation of synthetic questions.
It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.
arXiv Detail & Related papers (2024-07-15T13:17:42Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis [91.5632751731927]
Large Language Models such as ChatGPT have showcased remarkable abilities in solving general tasks.<n>We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders.<n>We analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results.
arXiv Detail & Related papers (2024-01-10T08:28:56Z) - The Tyranny of Possibilities in the Design of Task-Oriented LLM Systems:
A Scoping Survey [1.0489539392650928]
The paper begins by defining a minimal task-oriented LLM system and exploring the design space of such systems.
We discuss a pattern in our results and formulate them into three conjectures.
In all, the scoping survey presents seven conjectures that can help guide future research efforts.
arXiv Detail & Related papers (2023-12-29T13:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.