Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
- URL: http://arxiv.org/abs/2410.02749v2
- Date: Tue, 15 Oct 2024 03:42:01 GMT
- Title: Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
- Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus,
- Abstract summary: Large language models (LLMs) autoresourcedly synthesize programs in a single pass.
We develop a synthetic data generation algorithm called LintSeq to generate high-quality code edit data.
We show that edit sequence finetuned models produce more diverse programs than baselines synthesis.
- Score: 33.13471417703669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.
Related papers
- Benchmarking Large Language Models with Integer Sequence Generation Tasks [1.3108652488669736]
This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Sequences (OEIS)
Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences.
arXiv Detail & Related papers (2024-11-07T02:05:43Z) - Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [54.51932175059004]
We introduce a scalable method for generating synthetic instructions to enhance the code generation capability of Large Language Models.
The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds.
arXiv Detail & Related papers (2024-07-29T20:42:59Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Fully Autonomous Programming with Large Language Models [0.9558392439655015]
Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome"
We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation.
The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
arXiv Detail & Related papers (2023-04-20T16:12:05Z) - NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual
Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences.
The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation.
In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.