Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming
- URL: http://arxiv.org/abs/2405.01787v3
- Date: Wed, 4 Sep 2024 19:32:16 GMT
- Title: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming
- Authors: Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy,
- Abstract summary: We curate a dataset of 600K lines of open-source F* programs and proofs.
This dataset includes software used in production systems ranging from Windows and Linux to Python and Firefox.
We investigate the use of AI to synthesize programs and their proofs in F*, with promising results.
- Score: 8.34623776815378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem producing a definition given a formal specification expressed as an F* type. We provide a program fragment checker that queries F* to check the correctness of candidate solutions. We also report on an extended version of our dataset containing a total of 940K lines of programs and proofs, with a total of 54k top-level F* definitions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.
Related papers
- Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity [0.5370906227996627]
We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair.
Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language.
We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming.
arXiv Detail & Related papers (2025-02-17T15:24:11Z) - Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems.
arXiv Detail & Related papers (2025-02-11T15:27:35Z) - Proving the Coding Interview: A Benchmark for Formally Verified Code Generation [3.5319285228327417]
We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness.
We challenge the machine learning and program synthesis communities to solve both each general purpose programming problem and its associated correctness specifications.
arXiv Detail & Related papers (2025-02-08T22:54:58Z) - Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency.
Our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant.
arXiv Detail & Related papers (2025-01-30T12:37:06Z) - Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification [11.115201117143929]
We present Rango, a fully automated synthesis proof tool for Coq.
Rango identifies relevant premises and also similar proofs from the current project and uses them during synthesis.
Our evaluation shows that Rango adding relevant to its context leads to a 47% increase in the number of theorems proven.
arXiv Detail & Related papers (2024-12-18T17:08:42Z) - dafny-annotator: AI-Assisted Verification of Dafny Programs [4.651620941143133]
We explore using a combination of Large Language Models and search to build dafny-annotator.
On a test set from the DafnyBench collection of programs, greedy search guided by LLaMa 3.1 8B successfully annotates only 15.7% of the methods.
Our results suggest a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples.
arXiv Detail & Related papers (2024-11-05T19:27:56Z) - DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems.
We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs.
Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Fact-Checking Complex Claims with Program-Guided Reasoning [99.7212240712869]
Program-Guided Fact-Checking (ProgramFC) is a novel fact-checking model that decomposes complex claims into simpler sub-tasks.
We first leverage the in-context learning ability of large language models to generate reasoning programs.
We execute the program by delegating each sub-task to the corresponding sub-task handler.
arXiv Detail & Related papers (2023-05-22T06:11:15Z) - NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual
Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences.
The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation.
In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z) - Self-Supervised Learning to Prove Equivalence Between Straight-Line
Programs via Rewrite Rules [9.1570563482476]
Two programs are equivalent if there exists a sequence of application of rewrite rules that leads to rewriting one program into the other.
We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs.
Our system, S4Eq, achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent programs.
arXiv Detail & Related papers (2021-09-22T01:37:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.