Clover: Closed-Loop Verifiable Code Generation
- URL: http://arxiv.org/abs/2310.17807v4
- Date: Sat, 16 Nov 2024 21:57:49 GMT
- Title: Clover: Closed-Loop Verifiable Code Generation
- Authors: Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett,
- Abstract summary: We introduce the Clover paradigm, which uses consistency checking to provide a strong filter for incorrect code.
Clover performs consistency checks among code, docstrings, and formal annotations.
Clover also discovered 6 incorrect programs in the existing human-written dataset MBPP-DFY-50.
- Score: 2.5369596232063896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to undesirable outcomes. In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code. Clover performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its performance on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset: (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for adversarial incorrect ones (no false positives). Clover also discovered 6 incorrect programs in the existing human-written dataset MBPP-DFY-50.
Related papers
- IterPref: Focal Preference Learning for Code Generation via Iterative Debugging [28.020886216989872]
We propose IterPref, a new preference alignment framework for Code LLMs.
IterPref explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm.
IterPref achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench.
arXiv Detail & Related papers (2025-03-04T16:56:34Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.
However, improvement is plateauing due to the exhaustion of readily available high-quality data.
We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference [0.0]
Large language models (LLMs) have proven to be very capable, but access to the best models currently rely on inference providers which introduces trust challenges.
We propose TOPLOC, a novel method for verifiable inference that addresses this problem.
arXiv Detail & Related papers (2025-01-27T12:46:45Z) - AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement [25.80131224070207]
We aim to use formal verification to provide mathematical guarantees that the generated code is correct.
generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs.
We introduce AlphaVerus, a self-improving framework that bootstraps formally verified code generation.
arXiv Detail & Related papers (2024-12-09T03:22:35Z) - dafny-annotator: AI-Assisted Verification of Dafny Programs [4.651620941143133]
We explore using a combination of Large Language Models and search to build dafny-annotator.
On a test set from the DafnyBench collection of programs, greedy search guided by LLaMa 3.1 8B successfully annotates only 15.7% of the methods.
Our results suggest a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples.
arXiv Detail & Related papers (2024-11-05T19:27:56Z) - Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)
PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge [24.203554078434365]
We propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process.
Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large.
arXiv Detail & Related papers (2024-05-01T00:46:22Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code [13.135962181354465]
Code auditing ensures that developed code adheres to standards, regulations, and copyright protection.
The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing.
We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
arXiv Detail & Related papers (2024-02-14T16:41:35Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability [58.582216812183496]
Language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims.
Current LMs generate incorrect or nonsensical content, and are difficult to edit and bring up to date.
We present a method called Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate.
arXiv Detail & Related papers (2024-01-16T18:58:37Z) - Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search [7.822427053078387]
Generation-Augmented Retrieval (GAR) framework generates exemplar code snippets to augment queries.
We propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the for style normalization.
Code Style Similarity is the first metric tailored to quantify stylistic similarities in code.
arXiv Detail & Related papers (2024-01-09T12:12:50Z) - Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses.
We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document.
Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z) - Precise Zero-Shot Dense Retrieval without Relevance Labels [60.457378374671656]
Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system.
We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
arXiv Detail & Related papers (2022-12-20T18:09:52Z) - WeCheck: Strong Factual Consistency Checker via Weakly Supervised
Learning [40.5830891229718]
We propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck.
Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
arXiv Detail & Related papers (2022-12-20T08:04:36Z) - Converge to the Truth: Factual Error Correction via Iterative
Constrained Editing [30.740281040892086]
We propose VENCE, a novel method for factual error correction (FEC) with minimal edits.
VENCE formulates the FEC problem as iterative sampling editing actions with respect to a target density function.
Experiments on a public dataset show that VENCE improves the well-adopted SARI metric by 5.3 (or a relative improvement of 11.8%) over the previous best distantly-supervised methods.
arXiv Detail & Related papers (2022-11-22T10:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.