InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct
- URL: http://arxiv.org/abs/2407.05700v2
- Date: Mon, 16 Dec 2024 03:08:42 GMT
- Title: InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct
- Authors: Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen,
- Abstract summary: This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset.
We propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset.
- Score: 43.7550233177368
- License:
- Abstract: Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.
Related papers
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - SelfCodeAlign: Self-Alignment for Code Generation [15.23960029671979]
SelfCodeAlign is the first fully transparent and permissive pipeline for self-aligning code language models (LLMs)
It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks.
It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment.
Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller.
arXiv Detail & Related papers (2024-10-31T17:55:13Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Magicoder: Empowering Code Generation with OSS-Instruct [14.414411313794911]
We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code.
Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct.
Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks.
arXiv Detail & Related papers (2023-12-04T18:50:35Z) - InstructCoder: Instruction Tuning Large Language Models for Code Editing [26.160498475809266]
We explore the use of Large Language Models (LLMs) to edit code based on user instructions.
InstructCoder is the first instruction-tuning dataset designed to adapt LLMs for general-purpose code editing.
Our findings reveal that open-source LLMs fine-tuned on InstructCoder can significantly enhance the accuracy of code edits.
arXiv Detail & Related papers (2023-10-31T10:15:35Z) - Tuna: Instruction Tuning using Feedback from Large Language Models [74.04950416204551]
We propose finetuning an instruction-tuned large language model using our novel textitprobabilistic ranking and textitcontextual ranking approaches.
Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM.
On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs.
arXiv Detail & Related papers (2023-10-20T09:55:06Z) - WizardCoder: Empowering Code Large Language Models with Evol-Instruct [67.24653703564492]
We introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning.
Our model surpasses all other open-source Code LLMs by a substantial margin.
arXiv Detail & Related papers (2023-06-14T15:18:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.