Extracting Formal Specifications from Documents Using LLMs for Automated Testing
- URL: http://arxiv.org/abs/2504.01294v1
- Date: Wed, 02 Apr 2025 01:58:11 GMT
- Title: Extracting Formal Specifications from Documents Using LLMs for Automated Testing
- Authors: Hui Li, Zhen Dong, Siao Wang, Hui Zhang, Liwei Shen, Xin Peng, Dongdong She,
- Abstract summary: The main approach to defining formal specifications is through manual analysis of software documents.<n>System update further increases the human labor cost to maintain a corresponding formal specification.<n>Recent advances in Large Language Models have demonstrated promising capabilities in natural language understanding.
- Score: 11.129512305353055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated testing plays a crucial role in ensuring software security. It heavily relies on formal specifications to validate the correctness of the system behavior. However, the main approach to defining these formal specifications is through manual analysis of software documents, which requires a significant amount of engineering effort from experienced researchers and engineers. Meanwhile, system update further increases the human labor cost to maintain a corresponding formal specification, making the manual analysis approach a time-consuming and error-prone task. Recent advances in Large Language Models (LLMs) have demonstrated promising capabilities in natural language understanding. Yet, the feasibility of using LLMs to automate the extraction of formal specifications from software documents remains unexplored. We conduct an empirical study by constructing a comprehensive dataset comprising 603 specifications from 37 documents across three representative open-source software. We then evaluate the most recent LLMs' capabilities in extracting formal specifications from documents in an end-to-end fashion, including GPT-4o, Claude, and Llama. Our study demonstrates the application of LLMs in formal specification extraction tasks while identifying two major limitations: specification oversimplification and specification fabrication. We attribute these deficiencies to the LLMs' inherent limitations in processing and expressive capabilities, as well as their tendency to fabricate fictional information. Inspired by human cognitive processes, we propose a two-stage method, annotation-then-conversion, to address these challenges. Our method demonstrates significant improvements over the end-to-end method, with a 29.2% increase in the number of correctly extracted specifications and a 14.0% improvement in average accuracy. In particular, our best-performing LLM achieves an accuracy of 71.6%.
Related papers
- Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.<n>It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.<n>We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - Automatic High-quality Verilog Assertion Generation through Subtask-Focused Fine-Tuned LLMs and Iterative Prompting [0.0]
We present a large language model (LLM) -based flow to automatically generate high-quality SystemVerilog Assertions (SVA)
We introduce a novel sub-task-focused fine-tuning approach, leading to a remarkable 7.3-fold increase in the number of functionally correct assertions.
Experiments demonstrate a 26% increase in the number of assertions free from syntax errors using this approach.
arXiv Detail & Related papers (2024-11-23T03:52:32Z) - Exploring LLMs for Verifying Technical System Specifications Against Requirements [41.19948826527649]
The field of knowledge-based requirements engineering (KBRE) aims to support engineers by providing knowledge to assist in the elicitation, validation, and management of system requirements.
The advent of large language models (LLMs) opens new opportunities in the field of KBRE.
This work experimentally investigates the potential of LLMs in requirements verification.
arXiv Detail & Related papers (2024-11-18T13:59:29Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Enchanting Program Specification Synthesis by Large Language Models using Static Analysis and Program Verification [15.686651364655958]
AutoSpec is an automated approach to synthesize specifications for automated program verification.
It overcomes the shortcomings of existing work in specification versatility, synthesizing satisfiable and adequate specifications for full proof.
It can be successfully applied to verify the programs in a real-world X509-parser project.
arXiv Detail & Related papers (2024-03-31T18:15:49Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - SpecGen: Automated Generation of Formal Program Specifications via Large Language Models [20.36964281778921]
SpecGen is a novel technique for formal program specification generation based on Large Language Models.<n>We evaluate SpecGen on two datasets, including the SV-COMP 279 benchmark and a manually constructed dataset.
arXiv Detail & Related papers (2024-01-16T20:13:50Z) - How Effective are Large Language Models in Generating Software Specifications? [14.170320751508502]
Large Language Models (LLMs) have been successfully applied to numerous Software Engineering (SE) tasks.<n>We conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation.
arXiv Detail & Related papers (2023-06-06T00:28:39Z) - Estimating Large Language Model Capabilities without Labeled Test Data [51.428562302037534]
Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples.
We propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task.
arXiv Detail & Related papers (2023-05-24T06:55:09Z) - Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs)<n>Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.<n>We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.