Related papers: Combining Tests and Proofs for Better Software Verification

Combining Tests and Proofs for Better Software Verification

URL: http://arxiv.org/abs/2601.16239v1
Date: Wed, 21 Jan 2026 22:39:17 GMT
Title: Combining Tests and Proofs for Better Software Verification
Authors: Li Huang, Bertrand Meyer, Manuel Oriol,
Abstract summary: A different perspective is emerging, in which testing and proving are complementary rather than competing techniques for producing software of verified quality.<n>A counterexample is an input combination that makes the program fail.<n>One can, however, apply counterexample generation to incorrect programs, as a tool for automatic test generation.<n>We can also use these mechanisms to help produce program fixes for incorrect programs, with a guarantee that the fixes are correct.
Score: 30.905701881162685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test or prove? These two approaches to software verification have long been presented as opposites. One is dynamic, the other static: a test executes the program, a proof only analyzes the program text. A different perspective is emerging, in which testing and proving are complementary rather than competing techniques for producing software of verified quality. Work performed over the past few years and reviewed here develops this complementarity by taking advantage of Design by Contract, as available in Eiffel, and exploiting a feature of modern program-proving tools based on ``Satisfiability Modulo Theories'' (SMT): counterexample generation. A counterexample is an input combination that makes the program fail. If we are trying to prove a program correct, we hope not to find any. One can, however, apply counterexample generation to incorrect programs, as a tool for automatic test generation. We can also introduce faults into a correct program and turn the counterexamples into an automatically generated regression test suite with full coverage. Additionally, we can use these mechanisms to help produce program fixes for incorrect programs, with a guarantee that the fixes are correct. All three applications, leveraging on the mechanisms of Eiffel and Design by Contract, hold significant promise to address some of the challenges of program testing, software maintenance and Automatic Program Repair.

Related papers

Inferring multiple helper Dafny assertions with LLMs [47.33158055894705]
We investigate the use of Large Language Models to automatically infer missing helper assertions in Dafny programs.<n>We introduce a taxonomy of assertion types to analyze inference difficulty.<n>Results show that automated assertion inference can substantially reduce proof engineering effort.
arXiv Detail & Related papers (2025-10-31T09:45:39Z)
Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [79.74676890436174]
We present an APR tool for Dafny that uses formal specifications as oracles for fault localization and repair.<n>We localize faults through a series of steps, which include using Hoare logic to determine the state of each statement within the program.<n>Our tool achieves 89.6% fault localization coverage and GPT-4o mini yields the highest repair success rate of 74.18%.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces [3.7752830020595796]
We curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases.<n>The dataset comprises 288,617 hacks for 5,578 programming problems.<n>The source code for 2,196 submitted solutions to these problems can be broken with their corresponding hacks.
arXiv Detail & Related papers (2025-03-30T14:50:03Z)
A Unit Proofing Framework for Code-level Verification: A Research Agenda [2.7325338323814328]
We propose a research agenda for a unit proofing framework, both methods and tools.<n>This will enable engineers to discover code-level defects early.
arXiv Detail & Related papers (2024-10-18T18:37:36Z)
Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming [8.34623776815378]
We curate a dataset of 600K lines of open-source F* programs and proofs. This dataset includes software used in production systems ranging from Windows and Linux to Python and Firefox. We investigate the use of AI to synthesize programs and their proofs in F*, with promising results.
arXiv Detail & Related papers (2024-05-03T00:14:33Z)
Execution-free Program Repair [31.162584619997393]
The novel idea in the Proof2Fix methodology and tool presented here is to rely instead on a program prover, without the need to run tests or to run the program at all. Results show that Proof2Fix finds and fixes significant historical bugs.
arXiv Detail & Related papers (2024-05-02T14:18:40Z)
Retromorphic Testing: A New Approach to the Test Oracle Problem [4.183247207161993]
A test oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. Inspired by the mathematical concept of inverse function, we present Retromorphic Testing, a novel black-box testing methodology. This paper presents its three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications.
arXiv Detail & Related papers (2023-10-10T09:03:01Z)
NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences. The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation. In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Learning from Self-Sampled Correct and Partially-Correct Programs [96.66452896657991]
We propose to let the model perform sampling during training and learn from both self-sampled fully-correct programs and partially-correct programs. We show that our use of self-sampled correct and partially-correct programs can benefit learning and help guide the sampling process. Our proposed method improves the pass@k performance by 3.1% to 12.3% compared to learning from a single reference program with MLE.
arXiv Detail & Related papers (2022-05-28T03:31:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.