Related papers: Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity

Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity

URL: http://arxiv.org/abs/2502.11901v1
Date: Mon, 17 Feb 2025 15:24:11 GMT
Title: Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity
Authors: Dylan Zhang, Justin Wang, Tianran Sun,
Abstract summary: We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming.
Score: 0.5370906227996627
License:
Abstract: Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o's performance by 54% by repairing its outputs over GPT-4o's self-repair.

Related papers

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems.
arXiv Detail & Related papers (2025-02-11T15:27:35Z)
Phi-4 Technical Report [72.06109095293243]
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process.
arXiv Detail & Related papers (2024-12-12T03:37:41Z)
Automated Proof Generation for Rust Code via Self-Evolution [69.25795662658356]
We introduce SAFE, a novel framework that overcomes the lack of human-written proof to enable automated proof generation of Rust code. We demonstrate superior efficiency and precision compared to GPT-4o. This advancement leads to a significant improvement in performance, achieving a 70.50% accuracy rate in a benchmark crafted by human experts.
arXiv Detail & Related papers (2024-10-21T08:15:45Z)
Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs. The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process. We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming [8.34623776815378]
We curate a dataset of 600K lines of open-source F* programs and proofs. This dataset includes software used in production systems ranging from Windows and Linux to Python and Firefox. We investigate the use of AI to synthesize programs and their proofs in F*, with promising results.
arXiv Detail & Related papers (2024-05-03T00:14:33Z)
Towards AI-Assisted Synthesis of Verified Dafny Methods [1.0187122752343796]
Existing large language models show a severe lack of proficiency in verified programming. We show how to improve two pretrained models' proficiency in the Dafny verification-aware language.
arXiv Detail & Related papers (2024-02-01T00:07:23Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs) We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score) Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.