A Scalable and Extensible Approach to Benchmarking NL2Code for 18
Programming Languages
- URL: http://arxiv.org/abs/2208.08227v2
- Date: Fri, 19 Aug 2022 01:12:49 GMT
- Title: A Scalable and Extensible Approach to Benchmarking NL2Code for 18
Programming Languages
- Authors: Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna
Phipps-Costin, Donald Pinckney, Ming Ho Yee, Yangtian Zi, Carolyn Jane
Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, Abhinav Jangda
- Abstract summary: We propose MultiPL-E, the first multi-language parallel benchmark for natural-language-to-code-generation.
We evaluate two state-of-the-art code generation models on MultiPL-E: Codex and InCoder.
The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance.
- Score: 1.6312827172331896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have demonstrated the ability to condition on and
generate both natural language and programming language text. Such models open
up the possibility of multi-language code generation: could code generation
models generalize knowledge from one language to another? Although contemporary
code generation models can generate semantically correct Python code, little is
known about their abilities with other languages. We facilitate the exploration
of this topic by proposing MultiPL-E, the first multi-language parallel
benchmark for natural-language-to-code-generation.
MultiPL-E extends the HumanEval benchmark (Chen et al, 2021) to support 18
more programming languages, encompassing a range of programming paradigms and
popularity. We evaluate two state-of-the-art code generation models on
MultiPL-E: Codex and InCoder. We find that on several languages, Codex matches
and even exceeds its performance on Python. The range of programming languages
represented in MultiPL-E allow us to explore the impact of language frequency
and language features on model performance. Finally, the MultiPL-E approach of
compiling code generation benchmarks to new programming languages is both
scalable and extensible. We describe a general approach for easily adding
support for new benchmarks and languages to MultiPL-E.
Related papers
- CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation [60.799992690487336]
We propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks.
CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Can Large Language Models Write Parallel Code? [0.5317767988097261]
Large language models are increasingly becoming a popular tool for software development.
In this paper, we study the capabilities of state-of-the-art language models to generate parallel code.
arXiv Detail & Related papers (2024-01-23T08:25:12Z) - ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for
Programming Languages [37.60016772021422]
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa.
Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric.
We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs.
arXiv Detail & Related papers (2022-12-13T17:21:44Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.