Execution-Based Evaluation for Open-Domain Code Generation
- URL: http://arxiv.org/abs/2212.10481v2
- Date: Fri, 19 May 2023 14:27:46 GMT
- Title: Execution-Based Evaluation for Open-Domain Code Generation
- Authors: Zhiruo Wang, Shuyan Zhou, Daniel Fried, Graham Neubig
- Abstract summary: ODEX is the first Open-Domain EXecution-based natural language (NL) to Python code generation dataset.
ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution.
ODEX supports four natural languages as intents, in English, Spanish, Japanese, and Russian.
- Score: 81.96731162394445
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: To extend the scope of coding queries to more realistic settings, we propose
ODEX, the first Open-Domain EXecution-based natural language (NL) to Python
code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse
libraries, along with 1,707 human-written test cases for execution. Our NL-Code
pairs are harvested from StackOverflow forums to encourage natural and
practical coding queries. Moreover, ODEX supports four natural languages as
intents, in English, Spanish, Japanese, and Russian. ODEX unveils intriguing
behavioral differences among top-performing code language models (LM). While
CODEX achieves better overall results, CODEGEN improves effectively via scaling
-- CODEGEN 6.1B performs comparably with CODEX 12B. Both models show
substantial gaps between open and closed domains, but CODEGEN gaps tend to
decrease with model size while CODEX gaps increase. We release ODEX to
facilitate research into open-domain problems for the code generation
community.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code [20.60634057560564]
We propose a framework for EXtending Pre-trained language models for lOng-range code.
EXPO incorporates two innovative memory mechanisms: Bridge Memory and Hint Memory.
We validate the effectiveness of EXPO on five popular pre-trained language models such as UniXcoder.
arXiv Detail & Related papers (2024-05-18T09:06:41Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - DeepSeek-Coder: When the Large Language Model Meets Programming -- The
Rise of Code Intelligence [42.517055368627226]
We introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens.
Our evaluations demonstrate that DeepSeek-Coder achieves state-of-the-art performance among open-source code models across multiple benchmarks.
DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.
arXiv Detail & Related papers (2024-01-25T14:17:53Z) - XGen-7B Technical Report [138.71625147048377]
XGen is a series of 7B parameter models on up to 8K sequence length for up to 1.5T tokens.
We open-source our models for both research advancements and commercial applications.
arXiv Detail & Related papers (2023-09-07T02:20:03Z) - CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X [50.008474888951525]
We introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation.
CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages.
arXiv Detail & Related papers (2023-03-30T17:34:01Z) - Large Language Models Meet NL2Code: A Survey [19.606985859571083]
We present a comprehensive survey of 27 existing large language models for NL2Code.
Key factors contributing to the success of large language models for NL2Code are "Large Size, Premium Data, Expert Tuning"
arXiv Detail & Related papers (2022-12-19T12:55:32Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.