Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
- URL: http://arxiv.org/abs/2410.02749v3
- Date: Tue, 11 Feb 2025 18:59:47 GMT
- Title: Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
- Authors: Ulyana Piterbarg, Lerrel Pinto, Rob Fergus,
- Abstract summary: Language models (LMs) autorely synthesize programs in a single pass.<n>While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer.<n>We develop a synthetic data generation algorithm called LintSeq to fill this gap.
- Score: 33.13471417703669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software engineers mainly write code by editing existing programs. In contrast, language models (LMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of sequential edit data. While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors programs into sequences of synthetic edits by using a linter to procedurally sample across interdependent lines of source code. Synthetic edits sampled with LintSeq reflect the syntax and semantics of their programming language. To test the algorithm, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we fine-tune a series of smaller LMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset. We perform comprehensive evaluations comparing edit sequence code LMs against baselines on HumanEval, MBPP(+), CodeContests, DS-1000, and BigCodeBench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs. Finally, we also pretrain our own tiny LMs for code understanding. We show that fine-tuning these models to synthesize code edit-by-edit results in strong performance on HumanEval and MBPP(+) compared to existing code language models of similar scale such as CodeT5+, AlphaCode, and Codex.
Related papers
- KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.
It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.
This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - Benchmarking Large Language Models with Integer Sequence Generation Tasks [1.3108652488669736]
This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Sequences (OEIS)
Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences.
arXiv Detail & Related papers (2024-11-07T02:05:43Z) - LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning [23.987059076950622]
We propose a new data synthesis method called textbfLogicPro, which synthesizes complex underlineLogical Reasoning data in text format.
We synthesize data that is difficult, scalable, effective, and comes with golden standard answers and high-quality reasoning processes.
Our approach achieves significant improvements in multiple models for the datasets textitBBH$27$, textitLogicBench, textitDROP, textitAR-LSAT, and textitGSM8K
arXiv Detail & Related papers (2024-09-19T17:30:45Z) - Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [54.51932175059004]
We introduce a scalable method for generating synthetic instructions to enhance the code generation capability of Large Language Models.
The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds.
arXiv Detail & Related papers (2024-07-29T20:42:59Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - SynCode: LLM Generation with Grammar Augmentation [5.174301428591665]
SynCode is a novel framework for efficient and generalal decoding with LLMs.
It ensures soundness and completeness with respect to the CFG of a formal language, effectively retaining valid tokens while filtering out invalid ones.
Our experiments demonstrate that SynCode eliminates all syntax errors and significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-03-03T22:38:35Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Outline, Then Details: Syntactically Guided Coarse-To-Fine Code
Generation [61.50286000143233]
ChainCoder is a program synthesis language model that generates Python code progressively.
A tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples.
arXiv Detail & Related papers (2023-04-28T01:47:09Z) - Fully Autonomous Programming with Large Language Models [0.9558392439655015]
Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome"
We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation.
The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
arXiv Detail & Related papers (2023-04-20T16:12:05Z) - NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual
Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences.
The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation.
In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.