Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set
- URL: http://arxiv.org/abs/2602.04910v1
- Date: Wed, 04 Feb 2026 01:04:56 GMT
- Title: Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set
- Authors: Nongyu Di, Tianyu Chen, Shan Lu, Shuai Lu, Yeyun Gong, Peng Cheng, Jacob R. Lorch, Yuan Yao, Xiaoxing Ma,
- Abstract summary: We present VeruSyn, a data synthesis pipeline for Verus, a verification tool for system software written in Rust.<n>We synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification.<n>This dataset lets us create a fine-tuned Qwen2.5-Coder-32B-Instruct model with appealing cost-proof tradeoffs.
- Score: 40.85677634306877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are widely used for code generation. However, the correctness of code generated by LLMs remains a concern. A potential remedy to this concern is to have LLMs generate formal correctness proofs along with such code. However, compared with code generation, code-proof generation requires much higher reasoning capability and has much less existing data to learn from. In this paper, we present VeruSyn, a data synthesis pipeline for Verus, a state-of-the-art verification tool for system software written in Rust. Through self-synthesis and tutorial-based synthesis, VeruSyn achieves much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus; VeruSyn also supplements its dataset with long-chain-of-thought (CoT) data through agent trajectory synthesis. With VeruSyn, we synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets us create a fine-tuned Qwen2.5-Coder-32B-Instruct model with appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5. It also significantly outperforms models like o4-mini and previously proposed research models.
Related papers
- Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis [19.056892622506826]
Reward modeling is crucial for aligning large language models with human preferences.<n>Existing textual data synthesis methods are computationally expensive.<n>We propose a novel framework for synthesizing preference data directly in the language's latent embedding space.
arXiv Detail & Related papers (2025-09-30T10:48:50Z) - Iceberg: Enhancing HLS Modeling with Synthetic Data [61.48659845413156]
Iceberg is a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations.<n>Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels.
arXiv Detail & Related papers (2025-07-14T05:48:09Z) - Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification [9.332807762710127]
We introduce Infinite-Instruct, an automated framework for high-quality question-answer pairs.<n>The framework focuses on improving the internal logic of synthesized problems.<n>Cross-lingual static code analysis pipeline filters invalid samples to ensure data quality.
arXiv Detail & Related papers (2025-05-29T07:14:43Z) - Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - Training Language Models on Synthetic Edit Sequences Improves Code Synthesis [33.13471417703669]
Language models (LMs) autorely synthesize programs in a single pass.<n>While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer.<n>We develop a synthetic data generation algorithm called LintSeq to fill this gap.
arXiv Detail & Related papers (2024-10-03T17:57:22Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Too Big to Fail? Active Few-Shot Learning Guided Logic Synthesis [18.961915757370466]
We propose Bulls-Eye, that fine-tunes a pre-trained model on past synthesis data to accurately predict the quality of a synthesis recipe for an unseen netlist.
This approach achieves 2x-10x run-time improvement and better quality-of-result (QoR) than state-of-the-art machine learning approaches.
arXiv Detail & Related papers (2022-04-05T17:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.