Related papers: CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation

CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation

URL: http://arxiv.org/abs/2310.04951v2
Date: Wed, 25 Oct 2023 01:40:49 GMT
Title: CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation
Authors: Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, Wen Wang
Abstract summary: CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
Score: 8.979765541978292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent code translation techniques exploit neural machine translation models to translate source code from one programming language to another to satisfy production compatibility or to improve efficiency of codebase maintenance. Most existing code translation datasets only focus on a single pair of popular programming languages. To advance research on code translation and meet diverse requirements of real-world applications, we construct CodeTransOcean, a large-scale comprehensive benchmark that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs). CodeTransOcean also includes a novel cross-framework dataset, DLTrans, for translating deep learning code across different frameworks. We develop multilingual modeling approaches for code translation and demonstrate their great potential in improving the translation quality of both low-resource and high-resource language pairs and boosting the training efficiency. We also propose a novel evaluation metric Debugging Success Rate@K for program-level code translation. Last but not least, we evaluate LLM ChatGPT on our datasets and investigate its potential for fuzzy execution predictions. We build baselines for CodeTransOcean and analyze challenges of code translation for guiding future research. The CodeTransOcean datasets and code are publicly available at https://github.com/WeixiangYAN/CodeTransOcean.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z)
I Can't Share Code, but I need Translation -- An Empirical Study on Code Translation through Federated LLM [3.9373541926236766]
This study demonstrates that participants can collaboratively develop a FedLLM for efficient code translation. Our findings indicate that FedLLM offers a collaborative approach to code translation and could serve as a promising direction for future research in this field.
arXiv Detail & Related papers (2025-01-10T05:43:36Z)
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? [8.534857249221844]
We investigate using NL-specification as an intermediate representation for code translation. Our results show that using NL-specification alone does not lead to performance improvements. Besides analyzing the performance of code translation, we also investigate the quality of the translated code.
arXiv Detail & Related papers (2024-12-05T20:10:21Z)
Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? [4.616570111453259]
Large language models (LLMs) exhibit state-of-the-art performance in various tasks, but struggle for code translation. We conduct a large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks. We propose two methods: (1) intermediary translation which selects an intermediary language between the source and target ones; and (2) self-training which fine-tunes LLMs on self-generated parallel data.
arXiv Detail & Related papers (2024-10-13T12:20:12Z)
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs. We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files. Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z)
Data Augmentation for Code Translation with Comparable Corpora and Multiple References [21.754147577489764]
We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation. To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
arXiv Detail & Related papers (2023-11-01T06:01:22Z)
Program Translation via Code Distillation [20.668229308907495]
Traditional machine translation relies on parallel corpora for supervised translation. Recent unsupervised neural machine translation techniques have overcome data limitations. We propose a novel model called Code Distillation (CoDist)
arXiv Detail & Related papers (2023-10-17T04:59:15Z)
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z)
The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another. This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)
Using Document Similarity Methods to create Parallel Datasets for Code Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task. We propose to use document similarity methods to create noisy parallel datasets of code. We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z)
Improving Sign Language Translation with Monolingual Data by Sign Back-Translation [105.83166521438463]
We propose a sign back-translation (SignBT) approach, which incorporates massive spoken language texts into sign training. With a text-to-gloss translation model, we first back-translate the monolingual text to its gloss sequence. Then, the paired sign sequence is generated by splicing pieces from an estimated gloss-to-sign bank at the feature level.
arXiv Detail & Related papers (2021-05-26T08:49:30Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.