Related papers: A System Model Generation Benchmark from Natural Language Requirements

A System Model Generation Benchmark from Natural Language Requirements

URL: http://arxiv.org/abs/2508.03215v1
Date: Tue, 05 Aug 2025 08:45:19 GMT
Title: A System Model Generation Benchmark from Natural Language Requirements
Authors: Dongming Jin, Zhi Jin, Linyu Li, Zheng Fang, Jia Li, Xiaohong Chen,
Abstract summary: We present SysMBench, which comprises 151 human-curated scenarios spanning a wide range of popular domains.<n>Each scenario mainly comprises a natural language requirements description, a system model expressed in a specific model description language, and a visualized system model diagram.<n>We introduce SysMEval, a semantic-aware evaluation metric to evaluate the quality of generated system models.
Score: 29.820716110232553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: System models, a critical artifact in software development, provide a formal abstraction of both the structural and behavioral aspects of software systems, which can facilitate the early requirements analysis and architecture design. However, developing system models remains challenging due to the specific syntax of model description languages and the relative scarcity of public model examples. While large language models (LLMs) have shown promise in generating code with programming languages and could potentially aid in system model development, no benchmarks currently exist for evaluating their ability to generate system models with specific description languages. We present SysMBench, which comprises 151 human-curated scenarios spanning a wide range of popular domains and varying difficulty levels. Each scenario mainly comprises a natural language requirements description, a system model expressed in a specific model description language, and a visualized system model diagram. The requirements description is fed as user input to the LLM, the system model with description language is used to verify if the generated system model conforms to the requirements, and the visualized diagram serves to support manual validation. We introduce SysMEval, a semantic-aware evaluation metric to evaluate the quality of generated system models. We evaluate 17 popular LLMs on this task with three traditional metrics and SysMEval, from directly prompting to three commonly used enhancement strategies. Our in-depth evaluation shows that LLMs perform poorly on SysMBench, with the highest BLEU of 4% and SysMEval-F1 of 62%. We release the SysMBench and its evaluation framework to enable future research on LLM-based system model generation.

Related papers

Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams [0.0]
This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge.<n>It uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components.
arXiv Detail & Related papers (2025-07-09T12:44:49Z)
Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z)
Learnware of Language Models: Specialized Small Language Models Can Do Big [50.285859986475394]
This paper presents a preliminary attempt to apply the learnware paradigm to language models.<n>We simulated a learnware system comprising approximately 100 learnwares of specialized SLMs with 8B parameters.<n>By selecting one suitable learnware for each task-specific inference, the system outperforms the base SLMs on all benchmarks.
arXiv Detail & Related papers (2025-05-19T17:54:35Z)
LLM-enabled Instance Model Generation [4.52634430160579]
This work explores the generation of instance models using large language models (LLMs)<n>We propose a two-step approach: first, using LLMs to produce a simplified structured output containing all necessary instance model information, and then compiling this intermediate representation into a valid XMI file.<n>Results show that the proposed method significantly improves the usability of LLMs for instance model generation tasks.
arXiv Detail & Related papers (2025-03-28T16:34:29Z)
SysBench: Can Large Language Models Follow System Messages? [30.701602680394686]
Large Language Models (LLMs) have become instrumental across various applications, with the customization of these models to specific scenarios becoming increasingly critical. Despite the recognized potential of system messages to optimize AI-driven solutions, there is a notable absence of a benchmark for evaluating how well LLMs follow system messages. We introduce SysBench, a benchmark that systematically analyzes system message following ability in terms of three limitations of existing LLMs.
arXiv Detail & Related papers (2024-08-20T15:33:16Z)
Model Generation with LLMs: From Requirements to UML Sequence Diagrams [9.114284818139069]
This paper investigates the capability of ChatGPT to generate a specific type of model, i.e., sequence diagrams, from NL requirements. We examine the sequence diagrams generated by ChatGPT for 28 requirements documents of various types and from different domains. Our results indicate that, although the models generally conform to the standard and exhibit a reasonable level of understandability, their completeness and correctness with respect to the specified requirements often present challenges.
arXiv Detail & Related papers (2024-04-09T15:07:25Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
Prompt2Model: Generating Deployable Models from Natural Language Instructions [74.19816829003729]
Large language models (LLMs) enable system builders to create competent NLP systems through prompting. In other ways, LLMs are a step backward from traditional special-purpose NLP models. We propose Prompt2Model, a general-purpose method that takes a natural language task description like the prompts provided to LLMs.
arXiv Detail & Related papers (2023-08-23T17:28:21Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
From Natural Language to Simulations: Applying GPT-3 Codex to Automate Simulation Modeling of Logistics Systems [0.0]
This work is the first attempt to apply Natural Language Processing to automate the development of simulation models of systems vitally important for logistics. We demonstrated that the framework built on top of the fine-tuned GPT-3 Codex, a Transformer-based language model, could produce functionally valid simulations of queuing and inventory control systems given the verbal description.
arXiv Detail & Related papers (2022-02-24T14:01:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.