InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct
- URL: http://arxiv.org/abs/2407.05700v2
- Date: Mon, 16 Dec 2024 03:08:42 GMT
- Title: InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct
- Authors: Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen,
- Abstract summary: This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset.<n>We propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset.
- Score: 43.7550233177368
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.
Related papers
- OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.
Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.
We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - SelfCodeAlign: Self-Alignment for Code Generation [15.23960029671979]
SelfCodeAlign is the first fully transparent and permissive pipeline for self-aligning code language models (LLMs)
It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks.
It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment.
Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller.
arXiv Detail & Related papers (2024-10-31T17:55:13Z) - Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - Text-like Encoding of Collaborative Information in Large Language Models for Recommendation [58.87865271693269]
We introduce BinLLM, a novel method to seamlessly integrate collaborative information with Large Language Models for Recommendation (LLMRec)
BinLLM converts collaborative embeddings from external models into binary sequences.
BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths.
arXiv Detail & Related papers (2024-06-05T12:45:25Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - Mutation-based Consistency Testing for Evaluating the Code Understanding
Capability of LLMs [5.549095839198671]
Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages.
We propose a novel method to assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions.
We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs.
We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X.
arXiv Detail & Related papers (2024-01-11T14:27:43Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Magicoder: Empowering Code Generation with OSS-Instruct [14.414411313794911]
We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code.
Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct.
Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks.
arXiv Detail & Related papers (2023-12-04T18:50:35Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - WizardCoder: Empowering Code Large Language Models with Evol-Instruct [67.24653703564492]
We introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning.
Our model surpasses all other open-source Code LLMs by a substantial margin.
arXiv Detail & Related papers (2023-06-14T15:18:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.