Related papers: XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

URL: http://arxiv.org/abs/2206.08474v1
Date: Thu, 16 Jun 2022 22:49:39 GMT
Title: XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence
Authors: Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, Chandan K. Reddy
Abstract summary: This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages, and supports 10 cross-lingual code tasks.
Score: 9.673614921946932
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. Open source repositories like GitHub enable this process with rich unlabeled code data. However, the lack of high quality labeled data has largely hindered the progress of several code related tasks, such as program translation, summarization, synthesis, and code search. This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-lingual code tasks. To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages. We also provide the performance of several state-of-the-art baseline models for each task. We believe this new dataset can be a valuable asset for the research community and facilitate the development and validation of new methods for cross-lingual code intelligence.

Related papers

Multi-Agent Collaboration for Multilingual Code Instruction Tuning [41.74155456003822]
We introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs. Multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge.
arXiv Detail & Related papers (2025-02-11T11:46:38Z)
Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency. We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z)
CodeShell Technical Report [23.741490720927068]
We present CodeShell-Base, a foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs)
arXiv Detail & Related papers (2024-03-23T07:29:41Z)
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs. We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files. Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z)
CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation [8.979765541978292]
CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
arXiv Detail & Related papers (2023-10-08T00:16:18Z)
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation [5.2510537676167335]
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages. Our evaluations show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet.
arXiv Detail & Related papers (2023-05-09T09:35:03Z)
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval [32.60391966381949]
We introduce xCodeEval, the largest executable multilingual multitask benchmark to date. It features a total of $7$ tasks involving code understanding, generation, translation and retrieval. xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval.
arXiv Detail & Related papers (2023-03-06T10:08:51Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English. We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian. While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
CoDesc: A Large Code-Description Parallel Dataset [4.828053113572208]
We present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We show that the dataset helps improve code search by up to 22% and achieves the new state-of-the-art in code summarization.
arXiv Detail & Related papers (2021-05-29T05:40:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.