Related papers: Magicoder: Empowering Code Generation with OSS-Instruct

Magicoder: Empowering Code Generation with OSS-Instruct

URL: http://arxiv.org/abs/2312.02120v2
Date: Fri, 7 Jun 2024 02:50:56 GMT
Title: Magicoder: Empowering Code Generation with OSS-Instruct
Authors: Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang,
Abstract summary: We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks.
Score: 14.414411313794911
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate diverse instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs through the wealth of open-source references for the production of more realistic and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1 ). Overall, OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code using abundant open-source references.

Related papers

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [59.60208063956459]
Large Language Models (LLMs) require high quality instruction data for effective alignment. We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions.
arXiv Detail & Related papers (2024-07-29T20:42:59Z)
InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct [43.7550233177368]
This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset.
arXiv Detail & Related papers (2024-07-08T08:00:05Z)
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning [22.44573249705913]
We present WaveCoder, a series of Code LLMs trained with Widespread And Versatile Enhanced instruction data. To enable the models to tackle complex code-related tasks, we propose a method to stably generate diverse, high-quality instruction data from open source code dataset. Our experiments demonstrate that WaveCoder models significantly outperform other open-source models in terms of the generalization ability across different code-related tasks.
arXiv Detail & Related papers (2023-12-20T09:02:29Z)
WizardCoder: Empowering Code Large Language Models with Evol-Instruct [67.24653703564492]
We introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning. Our model surpasses all other open-source Code LLMs by a substantial margin.
arXiv Detail & Related papers (2023-06-14T15:18:48Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework. It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates. It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.