Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification
- URL: http://arxiv.org/abs/2505.23177v1
- Date: Thu, 29 May 2025 07:14:43 GMT
- Title: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification
- Authors: Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao,
- Abstract summary: We introduce Infinite-Instruct, an automated framework for high-quality question-answer pairs.<n>The framework focuses on improving the internal logic of synthesized problems.<n>Cross-lingual static code analysis pipeline filters invalid samples to ensure data quality.
- Score: 9.332807762710127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset
Related papers
- OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique [59.18475981916166]
We introduce OpenCodeReasoning-II, a dataset consisting of 2.5M question-solution-critique triples (approx. 35K unique programming questions)<n>In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance.
arXiv Detail & Related papers (2025-07-11T23:35:54Z) - OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.<n>Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.<n>We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z) - Robust Learning of Diverse Code Edits [10.565439872488328]
Software engineering activities frequently involve edits to existing code.<n>Code language models (LMs) lack the ability to handle diverse types of code-edit requirements.<n>We propose a novel synthetic data generation pipeline and an adaptation algorithm.
arXiv Detail & Related papers (2025-03-05T16:39:04Z) - KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - Training Language Models on Synthetic Edit Sequences Improves Code Synthesis [33.13471417703669]
Language models (LMs) autorely synthesize programs in a single pass.<n>While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer.<n>We develop a synthetic data generation algorithm called LintSeq to fill this gap.
arXiv Detail & Related papers (2024-10-03T17:57:22Z) - LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning [23.987059076950622]
We propose a new data synthesis method called textbfLogicPro, which synthesizes complex underlineLogical Reasoning data in text format.<n>We synthesize data that is difficult, scalable, effective, and comes with golden standard answers and high-quality reasoning processes.<n>Our approach achieves significant improvements in multiple models for the datasets textitBBH$27$, textitLogicBench, textitDROP, textitAR-LSAT, and textitGSM8K
arXiv Detail & Related papers (2024-09-19T17:30:45Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.