Evaluating LLM-Generated ACSL Annotations for Formal Verification
- URL: http://arxiv.org/abs/2602.13851v1
- Date: Sat, 14 Feb 2026 19:18:34 GMT
- Title: Evaluating LLM-Generated ACSL Annotations for Formal Verification
- Authors: Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan,
- Abstract summary: This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance.<n>Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models.<n>All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models--DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-based work.
Related papers
- Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles [3.4742046772246837]
In experiments on 110 Dafny programs, a multimodel approach combining Claude Opus 4.5 and GPT-5.2 generated correct annotations for 98.2% of the programs within at most 8 repair iterations.<n>A logistic regression analysis shows that proof-helper annotations contribute disproportionately to problem difficulty for current LLMs.
arXiv Detail & Related papers (2026-01-19T08:56:43Z) - FASTRIC: Prompt Specification Language for Verifiable LLM Interactions [3.8073142980732997]
Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent.<n>We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts.
arXiv Detail & Related papers (2025-12-22T01:19:50Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - Supporting Software Formal Verification with Large Language Models: An Experimental Study [9.688989142858954]
SpecVerify integrates large language models with formal verification tools.<n>This framework combines Claude 3.5 Sonnet with the ESBMC verifier to form an automated workflow.<n> Evaluated on nine cyber-physical systems from Lockheed Martin, SpecVerify achieves 46.5% verification accuracy, comparable to NASA's CoCoSim.
arXiv Detail & Related papers (2025-07-07T10:30:05Z) - A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability [0.8287206589886879]
We propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach to evaluate large language models (LLMs)<n>By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex.<n>Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations.
arXiv Detail & Related papers (2025-02-05T03:51:44Z) - AutoPLC: Generating Vendor-Aware Structured Text for Programmable Logic Controllers [9.209415852653386]
AutoPLC is a framework capable of automatically generating vendor-aware ST code from natural language requirements.<n>It is implemented for Siemens TIA Portal and the CODESYS platform.<n>AutoPLC achieves 90%+ compilation success on our 914-task benchmark.
arXiv Detail & Related papers (2024-12-03T12:05:56Z) - Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast [5.019791860882564]
This paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast.<n>Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable.
arXiv Detail & Related papers (2024-11-04T17:44:11Z) - Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [54.14602121129874]
We introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data.
AutoIF transforms the validation of instruction-following data quality into code verification.
arXiv Detail & Related papers (2024-06-19T13:29:53Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs [59.836774258359945]
OpenFactCheck is a framework for building customized automatic fact-checking systems.<n>It allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims.<n>CheckerEVAL is a solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets.
arXiv Detail & Related papers (2024-05-09T07:15:19Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.