Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?
- URL: http://arxiv.org/abs/2510.12702v1
- Date: Tue, 14 Oct 2025 16:37:39 GMT
- Title: Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?
- Authors: Cedric Richter, Heike Wehrheim,
- Abstract summary: Large Language Models (LLMs) have shown promise in inferring formal postconditions from natural language hints.<n>We introduce NL2Contract, the task of employing LLMs to translate informal natural language into formal functional contracts.<n>We evaluate NL2Contract with different LLMs and compare it to the task of postcondition generation nl2postcond.
- Score: 1.9551668880584971
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Automatic software verifiers have become increasingly effective at the task of checking software against (formal) specifications. Yet, their adoption in practice has been hampered by the lack of such specifications in real world code. Large Language Models (LLMs) have shown promise in inferring formal postconditions from natural language hints embedded in code such as function names, comments or documentation. Using the generated postconditions as specifications in a subsequent verification, however, often leads verifiers to suggest invalid inputs, hinting at potential issues that ultimately turn out to be false alarms. To address this, we revisit the problem of specification inference from natural language in the context of automatic software verification. In the process, we introduce NL2Contract, the task of employing LLMs to translate informal natural language into formal functional contracts, consisting of postconditions as well as preconditions. We introduce metrics to validate and compare different NL2Contract approaches, using soundness, bug discriminative power of the generated contracts and their usability in the context of automatic software verification as key metrics. We evaluate NL2Contract with different LLMs and compare it to the task of postcondition generation nl2postcond. Our evaluation shows that (1) LLMs are generally effective at generating functional contracts sound for all possible inputs, (2) the generated contracts are sufficiently expressive for discriminating buggy from correct behavior, and (3) verifiers supplied with LLM inferred functional contracts produce fewer false alarms than when provided with postconditions alone. Further investigations show that LLM inferred preconditions generally align well with developers intentions which allows us to use automatic software verifiers to catch real-world bugs.
Related papers
- Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement [8.059802912761919]
We uncover a systematic failure of large language models (LLMs) in matching code to natural language requirements.<n>More detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates.<n>We propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence.
arXiv Detail & Related papers (2026-02-28T08:35:25Z) - Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles [3.4742046772246837]
In experiments on 110 Dafny programs, a multimodel approach combining Claude Opus 4.5 and GPT-5.2 generated correct annotations for 98.2% of the programs within at most 8 repair iterations.<n>A logistic regression analysis shows that proof-helper annotations contribute disproportionately to problem difficulty for current LLMs.
arXiv Detail & Related papers (2026-01-19T08:56:43Z) - Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation [11.445615378917578]
PACT is a program assessment and contract-adherence evaluation framework.<n>It provides a comprehensive test-suite corpus focused on contract violations.<n>It enables a systematic analysis of code generation under varied prompting conditions.
arXiv Detail & Related papers (2025-10-14T01:12:37Z) - Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications [0.6813925418351435]
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks.<n>In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements.<n>Our results reveal that LLMs frequently misclassify correct code implementations as either not satisfying requirements'' or containing potential defects.
arXiv Detail & Related papers (2025-08-17T13:07:26Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [78.1575956773948]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z) - Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [79.74676890436174]
We present an APR tool for Dafny that uses formal specifications as oracles for fault localization and repair.<n>We localize faults through a series of steps, which include using Hoare logic to determine the state of each statement within the program.<n>Our tool achieves 89.6% fault localization coverage and GPT-4o mini yields the highest repair success rate of 74.18%.
arXiv Detail & Related papers (2025-07-04T15:36:12Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z) - Towards Large Language Model Aided Program Refinement [10.089955747110444]
Program refinement involves correctness-preserving transformations from formal high-level specification statements into executable programs.
Large language models (LLMs) enable automatic code generations from informal natural language specifications.
We propose LLM4PR, a tool that combines formal program refinement techniques with informal LLM-based methods.
arXiv Detail & Related papers (2024-06-26T04:29:27Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters.
We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier.
Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z) - Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions? [17.03841665553565]
Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to specifications that match programmer intent.
In this paper we describe nl2postcond, the problem of leveraging LLMs for informal natural language formal method postconditions, expressed as program assertions.
arXiv Detail & Related papers (2023-10-03T06:55:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.