XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence
- URL: http://arxiv.org/abs/2206.08474v1
- Date: Thu, 16 Jun 2022 22:49:39 GMT
- Title: XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence
- Authors: Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu
Tipirneni, Chandan K. Reddy
- Abstract summary: This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence.
Our dataset contains fine-grained parallel data from 8 languages, and supports 10 cross-lingual code tasks.
- Score: 9.673614921946932
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in machine learning have significantly improved the
understanding of source code data and achieved good performance on a number of
downstream tasks. Open source repositories like GitHub enable this process with
rich unlabeled code data. However, the lack of high quality labeled data has
largely hindered the progress of several code related tasks, such as program
translation, summarization, synthesis, and code search. This paper introduces
XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for
cross-lingual code intelligence. Our dataset contains fine-grained parallel
data from 8 languages (7 commonly used programming languages and English), and
supports 10 cross-lingual code tasks. To the best of our knowledge, it is the
largest parallel dataset for source code both in terms of size and the number
of languages. We also provide the performance of several state-of-the-art
baseline models for each task. We believe this new dataset can be a valuable
asset for the research community and facilitate the development and validation
of new methods for cross-lingual code intelligence.
Related papers
- Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - CodeShell Technical Report [23.741490720927068]
We present CodeShell-Base, a foundation model with 8K context length, showcasing exceptional proficiency in code comprehension.
We have curated 100 billion high-quality pre-training data from GitHub.
Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs)
arXiv Detail & Related papers (2024-03-23T07:29:41Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - CodeTransOcean: A Comprehensive Multilingual Benchmark for Code
Translation [8.979765541978292]
CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation.
CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
arXiv Detail & Related papers (2023-10-08T00:16:18Z) - The Vault: A Comprehensive Multilingual Dataset for Advancing Code
Understanding and Generation [5.2510537676167335]
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages.
Our evaluations show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet.
arXiv Detail & Related papers (2023-05-09T09:35:03Z) - xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code
Understanding, Generation, Translation and Retrieval [32.60391966381949]
We introduce xCodeEval, the largest executable multilingual multitask benchmark to date.
It features a total of $7$ tasks involving code understanding, generation, translation and retrieval.
xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval.
arXiv Detail & Related papers (2023-03-06T10:08:51Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - CoDesc: A Large Code-Description Parallel Dataset [4.828053113572208]
We present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions.
With extensive analysis, we identify and remove prevailing noise patterns from the dataset.
We show that the dataset helps improve code search by up to 22% and achieves the new state-of-the-art in code summarization.
arXiv Detail & Related papers (2021-05-29T05:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.