Dr. Boot: Bootstrapping Program Synthesis Language Models to Perform Repairing
- URL: http://arxiv.org/abs/2507.15889v1
- Date: Sun, 20 Jul 2025 02:10:46 GMT
- Title: Dr. Boot: Bootstrapping Program Synthesis Language Models to Perform Repairing
- Authors: Noah van der Vleuten,
- Abstract summary: We introduce a bootstrapping algorithm for program synthesis that supports teaching models how to repair.<n>We show that bootstrapping consistently outperforms regular fine-tuning.<n>We find that there are issues with the example test cases in the training portion of the APPS dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models for program synthesis are usually trained and evaluated on programming competition datasets (MBPP, APPS). However, these datasets are limited in size and quality, while these language models are extremely data hungry. Additionally, the language models have a misaligned program synthesis process compared to humans. While humans iteratively develop code with the help of a compiler, most program synthesis models currently produce code in one go. To solve these issues, we introduce a bootstrapping algorithm for program synthesis, that supports teaching models how to repair. We show that bootstrapping consistently outperforms regular fine-tuning. Compared to other work, our bootstrapped model performs on par with fine-tuned models that are 68\% larger. Notably, bootstrapping with repairing also improves non-repairing performance compared to regular bootstrapping during inference. However, on our models, repairing during inference is likely inferior to simply sampling the same number of solutions. Furthermore, we find that there are issues with the example test cases in the training portion of the APPS dataset that are valuable to the community, as many repairing and reinforcement learning methods rely on them.
Related papers
- Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training [54.581599828392854]
We propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model.
The training method simply introduces some noise at the input for the model to learn the denoising task.
Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance.
arXiv Detail & Related papers (2024-06-25T09:25:39Z) - Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences.
We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z) - Catwalk: A Unified Language Model Evaluation Framework for Many Datasets [50.75378592254184]
Catwalk provides a unified interface to a broad range of existing NLP datasets and models.
Catwalk substantially lowers the barriers to conducting controlled experiments at scale.
arXiv Detail & Related papers (2023-12-15T23:11:45Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Fully Autonomous Programming with Large Language Models [0.9558392439655015]
Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome"
We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation.
The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
arXiv Detail & Related papers (2023-04-20T16:12:05Z) - Enhancing Automated Program Repair through Fine-tuning and Prompt
Engineering [2.3826139428423576]
Sequence-to-sequence models have been used to transform erroneous programs into correct ones when trained with a large enough dataset.
Some recent studies demonstrated strong empirical evidence that code review could improve the program repair further.
We investigate if this inherent knowledge of PL and NL can be utilized to improve automated program repair.
arXiv Detail & Related papers (2023-04-16T17:29:51Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Program Synthesis with Large Language Models [40.41120807053989]
We evaluate large language models for program synthesis in Python.
We find that synthesis performance scales log-linearly with model size.
We find that even our best models are generally unable to predict the output of a program given a specific input.
arXiv Detail & Related papers (2021-08-16T03:57:30Z) - Patching as Translation: the Data and the Metaphor [18.22949296398319]
We show that "software patching is like language translation"
We show how a more principled approach to model design, based on our empirical findings and general knowledge of software development, can lead to better solutions.
We implement such models ourselves as "proof-of-concept" tools and empirically confirm that they behave in a fundamentally different, more effective way than the studied translation-based architectures.
arXiv Detail & Related papers (2020-08-24T21:05:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.