Related papers: Validating Formal Specifications with LLM-generated Test Cases

Validating Formal Specifications with LLM-generated Test Cases

URL: http://arxiv.org/abs/2510.23350v1
Date: Mon, 27 Oct 2025 14:02:20 GMT
Title: Validating Formal Specifications with LLM-generated Test Cases
Authors: Alcino Cunha, Nuno Macedo,
Abstract summary: This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements.<n>Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported.
Score: 1.2031796234206136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.

Related papers

LLMCFG-TGen: Using LLM-Generated Control Flow Graphs to Automatically Create Test Cases from Use Cases [11.173694789846435]
Appropriate test case generation is critical in software testing.<n>Use-case descriptions are a popular method for capturing functional behaviors and interaction flows in a structured form.<n>We propose a new approach that automatically generates test cases from NL use-case descriptions.
arXiv Detail & Related papers (2025-12-06T11:19:37Z)
Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z)
AsserT5: Test Assertion Generation Using a Fine-Tuned Code Language Model [8.995812770349602]
We propose AsserT5, a new model based on the pre-trained CodeT5 model.<n>We find that the abstraction and the inclusion of the focal method are useful also for a fine-tuned pre-trained model.
arXiv Detail & Related papers (2025-02-04T20:42:22Z)
System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT [1.9282110216621835]
This paper explores the effectiveness of using Large Language Models (LLMs) to generate test case designs from Software Requirements Specification (SRS) documents.<n>About 87 percent of the generated test cases were valid, with the remaining 13 percent either not applicable or redundant.
arXiv Detail & Related papers (2024-12-04T20:12:27Z)
Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy [0.5057850174013127]
Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code.<n>This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs.<n>Experiments show that VALTEST boosts test validity by up to 29% and improves code generation performance, as evidenced by significant increases in pass@1 scores.
arXiv Detail & Related papers (2024-11-13T00:07:32Z)
Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings. We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z)
Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications. Most testing procedures still rely on test cases written by humans to form test suites. We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z)
On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets [101.28658250723804]
This paper experiments with augmenting a transformer model with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action. We observe that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED. We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits.
arXiv Detail & Related papers (2022-05-18T23:52:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.