Related papers: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

URL: http://arxiv.org/abs/2405.01787v3
Date: Wed, 4 Sep 2024 19:32:16 GMT
Title: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming
Authors: Saikat Chakraborty, Gabriel Ebner, Siddharth Bhat, Sarah Fakhoury, Sakina Fatima, Shuvendu Lahiri, Nikhil Swamy,
Abstract summary: We curate a dataset of 600K lines of open-source F* programs and proofs. This dataset includes software used in production systems ranging from Windows and Linux to Python and Firefox. We investigate the use of AI to synthesize programs and their proofs in F*, with promising results.
Score: 8.34623776815378
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem producing a definition given a formal specification expressed as an F* type. We provide a program fragment checker that queries F* to check the correctness of candidate solutions. We also report on an extended version of our dataset containing a total of 940K lines of programs and proofs, with a total of 54k top-level F* definitions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.

Related papers

Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity [0.5370906227996627]
We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming.
arXiv Detail & Related papers (2025-02-17T15:24:11Z)
Proving the Coding Interview: A Benchmark for Formally Verified Code Generation [3.5319285228327417]
We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness. We challenge the machine learning and program synthesis communities to solve both each general purpose programming problem and its associated correctness specifications.
arXiv Detail & Related papers (2025-02-08T22:54:58Z)
Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency. Our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant.
arXiv Detail & Related papers (2025-01-30T12:37:06Z)
Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification [11.115201117143929]
We present Rango, a fully automated synthesis proof tool for Coq. Rango identifies relevant premises and also similar proofs from the current project and uses them during synthesis. Our evaluation shows that Rango adding relevant to its context leads to a 47% increase in the number of theorems proven.
arXiv Detail & Related papers (2024-12-18T17:08:42Z)
dafny-annotator: AI-Assisted Verification of Dafny Programs [4.651620941143133]
We explore using a combination of Large Language Models and search to build dafny-annotator. On a test set from the DafnyBench collection of programs, greedy search guided by LLaMa 3.1 8B successfully annotates only 15.7% of the methods. Our results suggest a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples.
arXiv Detail & Related papers (2024-11-05T19:27:56Z)
Learning to Reason via Program Generation, Emulation, and Search [33.11955431589091]
Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities. Not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. We propose Code Generation and Emulated EXecution (CoGEX) to extend an LM's program synthesis skills to such tasks.
arXiv Detail & Related papers (2024-05-25T19:40:50Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
Do Neutral Prompts Produce Insecure Code? FormAI-v2 Dataset: Labelling Vulnerabilities in Code Generated by Large Language Models [3.4887856546295333]
This study provides a comparative analysis of state-of-the-art large language models (LLMs) We analyze how likely they generate vulnerabilities when writing simple C programs using a neutral zero-shot prompt.
arXiv Detail & Related papers (2024-04-29T01:24:14Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Towards AI-Assisted Synthesis of Verified Dafny Methods [1.0187122752343796]
Existing large language models show a severe lack of proficiency in verified programming. We show how to improve two pretrained models' proficiency in the Dafny verification-aware language.
arXiv Detail & Related papers (2024-02-01T00:07:23Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
Fact-Checking Complex Claims with Program-Guided Reasoning [99.7212240712869]
Program-Guided Fact-Checking (ProgramFC) is a novel fact-checking model that decomposes complex claims into simpler sub-tasks. We first leverage the in-context learning ability of large language models to generate reasoning programs. We execute the program by delegating each sub-task to the corresponding sub-task handler.
arXiv Detail & Related papers (2023-05-22T06:11:15Z)
NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences. The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation. In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Self-Supervised Learning to Prove Equivalence Between Straight-Line Programs via Rewrite Rules [9.1570563482476]
Two programs are equivalent if there exists a sequence of application of rewrite rules that leads to rewriting one program into the other. We propose a neural network architecture based on a transformer model to generate proofs of equivalence between program pairs. Our system, S4Eq, achieves 97% proof success on a curated dataset of 10,000 pairs of equivalent programs.
arXiv Detail & Related papers (2021-09-22T01:37:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.