CodeTransOcean: A Comprehensive Multilingual Benchmark for Code
Translation
- URL: http://arxiv.org/abs/2310.04951v2
- Date: Wed, 25 Oct 2023 01:40:49 GMT
- Title: CodeTransOcean: A Comprehensive Multilingual Benchmark for Code
Translation
- Authors: Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, Wen Wang
- Abstract summary: CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation.
CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
- Score: 8.979765541978292
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent code translation techniques exploit neural machine translation models
to translate source code from one programming language to another to satisfy
production compatibility or to improve efficiency of codebase maintenance. Most
existing code translation datasets only focus on a single pair of popular
programming languages. To advance research on code translation and meet diverse
requirements of real-world applications, we construct CodeTransOcean, a
large-scale comprehensive benchmark that supports the largest variety of
programming languages for code translation. CodeTransOcean consists of three
novel multilingual datasets, namely, MultilingualTrans supporting translations
between multiple popular programming languages, NicheTrans for translating
between niche programming languages and popular ones, and LLMTrans for
evaluating executability of translated code by large language models (LLMs).
CodeTransOcean also includes a novel cross-framework dataset, DLTrans, for
translating deep learning code across different frameworks. We develop
multilingual modeling approaches for code translation and demonstrate their
great potential in improving the translation quality of both low-resource and
high-resource language pairs and boosting the training efficiency. We also
propose a novel evaluation metric Debugging Success Rate@K for program-level
code translation. Last but not least, we evaluate LLM ChatGPT on our datasets
and investigate its potential for fuzzy execution predictions. We build
baselines for CodeTransOcean and analyze challenges of code translation for
guiding future research. The CodeTransOcean datasets and code are publicly
available at https://github.com/WeixiangYAN/CodeTransOcean.
Related papers
- Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? [4.616570111453259]
Large language models (LLMs) exhibit state-of-the-art performance in various tasks, but struggle for code translation.
We conduct a large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks.
We propose two methods: (1) intermediary translation which selects an intermediary language between the source and target ones; and (2) self-training which fine-tunes LLMs on self-generated parallel data.
arXiv Detail & Related papers (2024-10-13T12:20:12Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Data Augmentation for Code Translation with Comparable Corpora and Multiple References [21.754147577489764]
We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation.
To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data.
Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
arXiv Detail & Related papers (2023-11-01T06:01:22Z) - Program Translation via Code Distillation [20.668229308907495]
Traditional machine translation relies on parallel corpora for supervised translation.
Recent unsupervised neural machine translation techniques have overcome data limitations.
We propose a novel model called Code Distillation (CoDist)
arXiv Detail & Related papers (2023-10-17T04:59:15Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another.
This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Improving Sign Language Translation with Monolingual Data by Sign
Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training.
With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence.
Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.