A Systematic Evaluation of Large Language Models of Code
- URL: http://arxiv.org/abs/2202.13169v2
- Date: Tue, 1 Mar 2022 19:13:06 GMT
- Title: A Systematic Evaluation of Large Language Models of Code
- Authors: Frank F. Xu, Uri Alon, Graham Neubig, Vincent J. Hellendoorn
- Abstract summary: Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
- Score: 88.34057460577957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LMs) of code have recently shown tremendous promise in
completing code and synthesizing code from natural language descriptions.
However, the current state-of-the-art code LMs (e.g., Codex (Chen et al.,
2021)) are not publicly available, leaving many questions about their model and
data design decisions. We aim to fill in some of these blanks through a
systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo,
GPT-NeoX-20B, and CodeParrot, across various programming languages. Although
Codex itself is not open-source, we find that existing open-source models do
achieve close results in some programming languages, although targeted mainly
for natural language modeling. We further identify an important missing piece
in the form of a large open-source model trained exclusively on a multi-lingual
corpus of code. We release a new model, PolyCoder, with 2.7B parameters based
on the GPT-2 architecture, which was trained on 249GB of code across 12
programming languages on a single machine. In the C programming language,
PolyCoder outperforms all models including Codex. Our trained models are
open-source and publicly available at https://github.com/VHellendoorn/Code-LMs,
which enables future research and application in this area.
Related papers
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - DeepSeek-Coder: When the Large Language Model Meets Programming -- The
Rise of Code Intelligence [42.517055368627226]
We introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens.
Our evaluations demonstrate that DeepSeek-Coder achieves state-of-the-art performance among open-source code models across multiple benchmarks.
DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.
arXiv Detail & Related papers (2024-01-25T14:17:53Z) - StarCoder: may the source be with you! [79.93915935620798]
The BigCode community introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length.
StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories.
arXiv Detail & Related papers (2023-05-09T08:16:42Z) - CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X [50.008474888951525]
We introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation.
CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages.
arXiv Detail & Related papers (2023-03-30T17:34:01Z) - SantaCoder: don't reach for the stars! [27.050410834027705]
The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.
We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark.
Our best model outperforms previous open-source multilingual code generation models in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E.
arXiv Detail & Related papers (2023-01-09T10:52:35Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - A Scalable and Extensible Approach to Benchmarking NL2Code for 18
Programming Languages [1.6312827172331896]
We propose MultiPL-E, the first multi-language parallel benchmark for natural-language-to-code-generation.
We evaluate two state-of-the-art code generation models on MultiPL-E: Codex and InCoder.
The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance.
arXiv Detail & Related papers (2022-08-17T11:16:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.