Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
- URL: http://arxiv.org/abs/2407.21077v1
- Date: Mon, 29 Jul 2024 20:42:59 GMT
- Title: Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
- Authors: Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, Boris Ginsburg,
- Abstract summary: We introduce a scalable method for generating synthetic instructions to enhance the code generation capability of Large Language Models.
The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds.
- Score: 54.51932175059004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) rely on instruction samples for alignment, but creating these datasets poses challenges, particularly in expert-dependent tasks like coding, which can be cost-prohibitive. One approach to mitigate these challenges is synthesizing data using another LLM. In this paper, we introduce a scalable method for generating synthetic instructions to enhance the code generation capability of LLMs. The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds. Genetic-Instruct is designed for efficient scaling of the generation process. Fine-tuning multiple coding LLMs with the synthetic samples demonstrates a significant improvement in their code generation accuracy compared to the baselines.
Related papers
- Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs [1.8838588087156363]
This paper investigates the fine-tuning of code-generating Large Language Models (LLMs)
We enhance the training data for the reward model with the help of symbolic execution techniques.
Our reward models, fine-tuned on this dataset, demonstrate significant improvements over the baseline, CodeRL.
arXiv Detail & Related papers (2025-04-21T16:29:07Z) - Empowering AI to Generate Better AI Code: Guided Generation of Deep Learning Projects with LLMs [4.616570111453259]
Large language models (LLMs) struggle with generating entire deep learning projects.
We propose a novel planning-guided code generation method, DLCodeGen, tailored for generating deep learning projects.
arXiv Detail & Related papers (2025-04-21T13:09:25Z) - OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.
Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.
We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.
evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.
We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design [55.54477725000291]
HiVeGen is a hierarchical Verilog generation framework that decomposes generation tasks into hierarchical submodules.
automatic Design Space Exploration (DSE) into hierarchy-aware prompt generation, introducing weight-based retrieval to enhance code reuse.
Real-time human-computer interaction to lower error-correction cost, significantly improving the quality of generated designs.
arXiv Detail & Related papers (2024-12-06T19:37:53Z) - Training LLMs for Generating IEC 61131-3 Structured Text with Online Feedback [0.0]
This paper proposes a novel approach to training large language models (LLMs) that emphasizes improving the quality of learning data.
The framework proves highly suitable for industrial automation applications and outperforms state-of-the-art models.
arXiv Detail & Related papers (2024-10-29T15:54:09Z) - Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation [51.20656279478878]
MATRIX is a multi-agent simulator that automatically generates diverse text-based scenarios.
We introduce MATRIX-Gen for controllable and highly realistic data synthesis.
On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model.
arXiv Detail & Related papers (2024-10-18T08:01:39Z) - EPiC: Cost-effective Search-based Prompt Engineering of LLMs for Code Generation [8.009881267479189]
Large Language Models (LLMs) have seen increasing use in various software development tasks, especially in code generation.
We propose an alternative approach named Evolutionary Prompt Engineering for Code (EPiC) to evolve the original prompts toward better ones that produce high-quality code.
Our evaluation against state-of-the-art (SOTA) LLM-based code generation models shows that EPiC outperforms all the baselines in terms of cost-effectiveness.
arXiv Detail & Related papers (2024-08-20T21:15:36Z) - Case2Code: Learning Inductive Reasoning with Synthetic Data [105.89741089673575]
We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
We first evaluate representative LLMs on the synthesized Case2Code task and demonstrate that the Case-to-code induction is challenging for LLMs.
Experimental results show that such induction training benefits not only in distribution Case2Code performance but also enhances various coding abilities of trained LLMs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z) - AICoderEval: Improving AI Domain Code Generation of Large Language Models [10.060988050644076]
We open-source the AICoderEval dataset to facilitate research in this area.
We propose CoderGen, an agent-based framework, to help LLMs generate codes related to real-world tasks.
We train a more powerful task-specific code generation model, named AICoder, which is refined on llama-3 based on AICoderEval.
arXiv Detail & Related papers (2024-06-07T07:45:38Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation [35.10931307279044]
This paper proposes Self-Evaluation Decoding, SED, a decoding method for enhancing model generation.
It integrates speculation and evaluation steps into the decoding process, allowing LLMs to make more careful decisions.
arXiv Detail & Related papers (2024-05-26T12:43:18Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.