Related papers: Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles

Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles

URL: http://arxiv.org/abs/2601.12845v1
Date: Mon, 19 Jan 2026 08:56:43 GMT
Title: Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles
Authors: João Pascoal Faria, Emanuel Trigo, Vinicius Honorato, Rui Abreu,
Abstract summary: In experiments on 110 Dafny programs, a multimodel approach combining Claude Opus 4.5 and GPT-5.2 generated correct annotations for 98.2% of the programs within at most 8 repair iterations.<n>A logistic regression analysis shows that proof-helper annotations contribute disproportionately to problem difficulty for current LLMs.
Score: 3.4742046772246837
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent verification tools aim to make formal verification more accessible to software engineers by automating most of the verification process. However, annotating conventional programs with the formal specification and verification constructs (preconditions, postconditions, loop invariants, auxiliary predicates and functions and proof helpers) required to prove their correctness still demands significant manual effort and expertise. This paper investigates how LLMs can automatically generate such annotations for programs written in Dafny, a verification-aware programming language, starting from conventional code accompanied by natural language specifications (in comments) and test code. In experiments on 110 Dafny programs, a multimodel approach combining Claude Opus 4.5 and GPT-5.2 generated correct annotations for 98.2% of the programs within at most 8 repair iterations, using verifier feedback. A logistic regression analysis shows that proof-helper annotations contribute disproportionately to problem difficulty for current LLMs. Assertions in the test cases served as static oracles to automatically validate the generated pre/postconditions. We also compare generated and manual solutions and present an extension for Visual Studio Code to incorporate automatic generation into the IDE, with encouraging usability feedback.

Related papers

Evaluating LLM-Generated ACSL Annotations for Formal Verification [0.0]
This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance.<n>Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models.<n>All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers.
arXiv Detail & Related papers (2026-02-14T19:18:34Z)
Guidelines to Prompt Large Language Models for Code Generation: An Empirical Characterization [82.29178197694819]
We derive and evaluate development-specific prompt optimization guidelines.<n>We use an iterative, test-driven approach to automatically refine code generation prompts.<n>We conduct an assessment with 50 practitioners, who report their usage of the elicited prompt improvement patterns.
arXiv Detail & Related papers (2026-01-19T15:01:42Z)
Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification? [1.9551668880584971]
Large Language Models (LLMs) have shown promise in inferring formal postconditions from natural language hints.<n>We introduce NL2Contract, the task of employing LLMs to translate informal natural language into formal functional contracts.<n>We evaluate NL2Contract with different LLMs and compare it to the task of postcondition generation nl2postcond.
arXiv Detail & Related papers (2025-10-14T16:37:39Z)
Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z)
Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [79.74676890436174]
We present an APR tool for Dafny that uses formal specifications as oracles for fault localization and repair.<n>We localize faults through a series of steps, which include using Hoare logic to determine the state of each statement within the program.<n>Our tool achieves 89.6% fault localization coverage and GPT-4o mini yields the highest repair success rate of 74.18%.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [54.14602121129874]
We introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification.
arXiv Detail & Related papers (2024-06-19T13:29:53Z)
Enchanting Program Specification Synthesis by Large Language Models using Static Analysis and Program Verification [15.686651364655958]
AutoSpec is an automated approach to synthesize specifications for automated program verification. It overcomes the shortcomings of existing work in specification versatility, synthesizing satisfiable and adequate specifications for full proof. It can be successfully applied to verify the programs in a real-world X509-parser project.
arXiv Detail & Related papers (2024-03-31T18:15:49Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters. We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier. Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z)
Automatic Program Instrumentation for Automatic Verification (Extended Technical Report) [0.0]
In deductive verification and software model checking, dealing with certain specification language constructs can be problematic. We propose instrumentation as a unifying verification paradigm that subsumes various existing ad-hoc approaches. We implement our approach in the MonoCera tool, which is tailored to the verification of programs with aggregation.
arXiv Detail & Related papers (2023-05-26T14:55:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.