xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code
Understanding, Generation, Translation and Retrieval
- URL: http://arxiv.org/abs/2303.03004v4
- Date: Mon, 6 Nov 2023 07:16:58 GMT
- Title: xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code
Understanding, Generation, Translation and Retrieval
- Authors: Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi
Wang, Md Rizwan Parvez, Shafiq Joty
- Abstract summary: We introduce xCodeEval, the largest executable multilingual multitask benchmark to date.
It features a total of $7$ tasks involving code understanding, generation, translation and retrieval.
xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval.
- Score: 32.60391966381949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, pre-trained large language models (LLMs) have shown impressive
abilities in generating codes from natural language descriptions, repairing
buggy codes, translating codes between languages, and retrieving relevant code
segments. However, the evaluation of these models has often been performed in a
scattered way on only one or two specific tasks, in a few languages, at a
partial granularity (e.g., function) level, and in many cases without proper
training data. Even more concerning is that in most cases the evaluation of
generated codes has been done in terms of mere lexical overlap with a reference
code rather than actual execution. We introduce xCodeEval, the largest
executable multilingual multitask benchmark to date consisting of $25$M
document-level coding examples ($16.5$B tokens) from about $7.5$K unique
problems covering up to $11$ programming languages with execution-level
parallelism. It features a total of $7$ tasks involving code understanding,
generation, translation and retrieval. xCodeEval adopts an execution-based
evaluation and offers a multilingual code execution engine, ExecEval that
supports unit test based execution in all the $11$ languages. To address the
challenge of balancing the distributions of text-code samples over multiple
attributes in validation/test sets, we propose a novel data splitting and a
data selection schema based on the geometric mean and graph-theoretic
principle. Our experiments with OpenAI's LLMs (zero-shot) and open-LLMs
(zero-shot and fine-tuned) on the tasks and languages demonstrate **xCodeEval**
to be quite challenging as per the current advancements in language models.
Related papers
- mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation [28.531581489405745]
mHumanEval is an extended benchmark supporting prompts in over 200 natural languages.
We provide expert human translations for 15 diverse natural languages (NLs)
We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs.
arXiv Detail & Related papers (2024-10-19T08:44:26Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Can Large Language Models Write Parallel Code? [0.5317767988097261]
Large language models are increasingly becoming a popular tool for software development.
In this paper, we study the capabilities of state-of-the-art language models to generate parallel code.
arXiv Detail & Related papers (2024-01-23T08:25:12Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z) - GLUECoS : An Evaluation Benchmark for Code-Switched NLP [17.066725832825423]
We present an evaluation benchmark, GLUECoS, for code-switched languages.
We present results on several NLP tasks in English-Hindi and English-Spanish.
We fine-tune multilingual models on artificially generated code-switched data.
arXiv Detail & Related papers (2020-04-26T13:28:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.