UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance
- URL: http://arxiv.org/abs/2502.11460v1
- Date: Mon, 17 Feb 2025 05:37:02 GMT
- Title: UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance
- Authors: Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, Kai Chen,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
- Score: 65.01483640267885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP) demonstrate that models fine-tuned on our synthetic data exhibit consistent performance improvements. Notably, Llama3.1-8B and InternLM2.5-7B improve from 31\% and 28\% to 40\% and 39\% success rates on BigCodeBench, respectively. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released (https://github.com).
Related papers
- OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.
Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.
We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z) - KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.
It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.
This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.
However, improvement is plateauing due to the exhaustion of readily available high-quality data.
We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Evaluating and Aligning CodeLLMs on Human Preference [42.26173776584043]
We present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks.<n>We also propose a diverse synthetic instruction corpus SynCode-Instruct to verify the effectiveness of the large-scale synthetic instruction fine-tuning.<n>The results find performance differences between execution-based benchmarks and CodeArena.
arXiv Detail & Related papers (2024-12-06T17:40:38Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation.<n>Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs.<n>We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z) - CodeCoT: Tackling Code Syntax Errors in CoT Reasoning for Code
Generation [6.139760107605468]
Chain-of-thought (CoT) has emerged as a groundbreaking tool in NLP, notably for its efficacy in complex reasoning tasks.
We present Code Chain-of-Thought (CodeCoT) that integrates CoT with a self-examination process for code generation.
arXiv Detail & Related papers (2023-08-17T04:58:51Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Better Language Models of Code through Self-Improvement [18.75015225501755]
We propose a simple data augmentation framework for pre-trained language models for code (PLMCs)
Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step.
The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks.
arXiv Detail & Related papers (2023-04-02T10:59:19Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.