Related papers: Evaluating LLM-Generated ACSL Annotations for Formal Verification

Evaluating LLM-Generated ACSL Annotations for Formal Verification

URL: http://arxiv.org/abs/2602.13851v1
Date: Sat, 14 Feb 2026 19:18:34 GMT
Title: Evaluating LLM-Generated ACSL Annotations for Formal Verification
Authors: Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan,
Abstract summary: This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance.<n>Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models.<n>All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models--DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-based work.

Related papers

Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles [3.4742046772246837]
In experiments on 110 Dafny programs, a multimodel approach combining Claude Opus 4.5 and GPT-5.2 generated correct annotations for 98.2% of the programs within at most 8 repair iterations.<n>A logistic regression analysis shows that proof-helper annotations contribute disproportionately to problem difficulty for current LLMs.
arXiv Detail & Related papers (2026-01-19T08:56:43Z)
FASTRIC: Prompt Specification Language for Verifiable LLM Interactions [3.8073142980732997]
Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent.<n>We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts.
arXiv Detail & Related papers (2025-12-22T01:19:50Z)
Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z)
Supporting Software Formal Verification with Large Language Models: An Experimental Study [9.688989142858954]
SpecVerify integrates large language models with formal verification tools.<n>This framework combines Claude 3.5 Sonnet with the ESBMC verifier to form an automated workflow.<n> Evaluated on nine cyber-physical systems from Lockheed Martin, SpecVerify achieves 46.5% verification accuracy, comparable to NASA's CoCoSim.
arXiv Detail & Related papers (2025-07-07T10:30:05Z)
A Systematic Approach for Assessing Large Language Models' Test Case Generation Capability [0.8287206589886879]
We propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach to evaluate large language models (LLMs)<n>By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex.<n>Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations.
arXiv Detail & Related papers (2025-02-05T03:51:44Z)
AutoPLC: Generating Vendor-Aware Structured Text for Programmable Logic Controllers [9.209415852653386]
AutoPLC is a framework capable of automatically generating vendor-aware ST code from natural language requirements.<n>It is implemented for Siemens TIA Portal and the CODESYS platform.<n>AutoPLC achieves 90%+ compilation success on our 914-task benchmark.
arXiv Detail & Related papers (2024-12-03T12:05:56Z)
Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast [5.019791860882564]
This paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast.<n>Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable.
arXiv Detail & Related papers (2024-11-04T17:44:11Z)
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [54.14602121129874]
We introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification.
arXiv Detail & Related papers (2024-06-19T13:29:53Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs [59.836774258359945]
OpenFactCheck is a framework for building customized automatic fact-checking systems.<n>It allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims.<n>CheckerEVAL is a solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets.
arXiv Detail & Related papers (2024-05-09T07:15:19Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.