Related papers: Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs

Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs

URL: http://arxiv.org/abs/2507.03659v3
Date: Mon, 08 Sep 2025 19:11:06 GMT
Title: Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs
Authors: Valentina Wu, Alexandra Mendes, Alexandre Abreu,
Abstract summary: We present an APR tool for Dafny that uses formal specifications as oracles for fault localization and repair.<n>We localize faults through a series of steps, which include using Hoare logic to determine the state of each statement within the program.<n>Our tool achieves 89.6% fault localization coverage and GPT-4o mini yields the highest repair success rate of 74.18%.
Score: 79.74676890436174
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Debugging and repairing faults when programs fail to formally verify can be complex and time-consuming. Automated Program Repair (APR) can ease this burden by automatically identifying and fixing faults. However, traditional APR techniques often rely on test suites for validation, but these may not capture all possible scenarios. In contrast, formal specifications provide strong correctness criteria, enabling more effective automated repair. In this paper, we present an APR tool for Dafny, a verification-aware programming language that uses formal specifications - including pre-conditions, post-conditions, and invariants - as oracles for fault localization and repair. Assuming the correctness of the specifications and focusing on arithmetic bugs, we localize faults through a series of steps, which include using Hoare logic to determine the state of each statement within the program, and applying Large Language Models (LLMs) to synthesize candidate fixes. The models considered are GPT-4o mini, Llama 3, Mistral 7B, and Llemma 7B. We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs. Our tool achieves 89.6% fault localization coverage and GPT-4o mini yields the highest repair success rate of 74.18%. These results highlight the potential of combining formal reasoning with LLM-based program synthesis for automated program repair.

Related papers

Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles [3.4742046772246837]
In experiments on 110 Dafny programs, a multimodel approach combining Claude Opus 4.5 and GPT-5.2 generated correct annotations for 98.2% of the programs within at most 8 repair iterations.<n>A logistic regression analysis shows that proof-helper annotations contribute disproportionately to problem difficulty for current LLMs.
arXiv Detail & Related papers (2026-01-19T08:56:43Z)
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization [73.0780809974414]
We propose a Reflective Autoformalization method that integrates semantic consistency evaluation into the autoformalization process.<n>This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors.<n>Experiments show that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines.
arXiv Detail & Related papers (2025-10-28T16:22:54Z)
RelRepair: Enhancing Automated Program Repair by Retrieving Relevant Code [11.74568238259256]
RelRepair retrieves relevant project-specific code to enhance automated program repair.<n>We evaluate RelRepair on two widely studied datasets, Defects4J V1.2 and ManySStuBs4J.
arXiv Detail & Related papers (2025-09-20T14:07:28Z)
Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
Counterexample Guided Program Repair Using Zero-Shot Learning and MaxSAT-based Fault Localization [0.0]
Automated Program Repair (APR) for introductory programming assignments (IPAs) is motivated by the large number of student enrollments.<n>We propose a novel approach that combines the strengths of both FM-based fault localization and Large Language Models (LLMs)<n>Our method uses MaxSAT-based fault localization to identify buggy parts of a program, then presents the LLM with a program sketch devoid of these buggy statements.
arXiv Detail & Related papers (2024-12-19T12:08:44Z)
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
A Deep Dive into Large Language Models for Automated Bug Localization and Repair [12.756202755547024]
Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR) In this study, we take a deep dive into automated bug fixing utilizing LLMs. This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information. Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark.
arXiv Detail & Related papers (2024-04-17T17:48:18Z)
An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications [7.286515881369693]
This paper systematically investigates the capacity of Large Language Models (LLMs) to repair declarative specifications in Alloy.<n>We designed 12 different repair settings, encompassing single-agent and dual-agent paradigms, utilizing various LLMs.<n>Our study reveals that dual-agent with auto-prompting setup outperforms the other settings, albeit with a marginal increase in the number of iterations and token usage.
arXiv Detail & Related papers (2024-04-17T03:46:38Z)
Aligning the Objective of LLM-based Program Repair [14.935596175148586]
This paper investigates a new approach to adapt large language models (LLMs) to program repair.<n>Our core insight is that LLM's APR capability can be greatly improved by simply aligning the output to their training objective.<n>Based on this insight, we designed D4C, a straightforward prompting framework for APR.
arXiv Detail & Related papers (2024-04-13T02:36:40Z)
A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back. Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair. This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
Is Self-Repair a Silver Bullet for Code Generation? [68.02601393906083]
Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance. We analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS.
arXiv Detail & Related papers (2023-06-16T15:13:17Z)
Fully Autonomous Programming with Large Language Models [0.9558392439655015]
Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome" We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
arXiv Detail & Related papers (2023-04-20T16:12:05Z)
LM-Critic: Language Models for Unsupervised Grammatical Error Correction [128.9174409251852]
We show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical. We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector.
arXiv Detail & Related papers (2021-09-14T17:06:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.