Data Augmentation for Code Translation with Comparable Corpora and Multiple References
- URL: http://arxiv.org/abs/2311.00317v2
- Date: Fri, 04 Oct 2024 04:16:21 GMT
- Title: Data Augmentation for Code Translation with Comparable Corpora and Multiple References
- Authors: Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose,
- Abstract summary: We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation.
To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data.
Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
- Score: 21.754147577489764
- License:
- Abstract: One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans.
Related papers
- CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming [15.391781573025787]
We introduce CodeRosetta, an encoder-decoder model designed specifically for translating between programming languages and their HPC extensions.
CodeRosetta is evaluated on C++ to parallel C++ translation tasks.
Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to translation.
arXiv Detail & Related papers (2024-10-27T17:34:07Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine
Translation of Lecture Transcripts [50.00305136008848]
We propose a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera.
For both English--Japanese and English--Chinese lecture translations, we extracted parallel corpora of approximately 50,000 lines and created development and test sets.
This study also suggests guidelines for gathering and cleaning corpora, mining parallel sentences, cleaning noise in the mined data, and creating high-quality evaluation splits.
arXiv Detail & Related papers (2023-11-07T03:50:25Z) - Program Translation via Code Distillation [20.668229308907495]
Traditional machine translation relies on parallel corpora for supervised translation.
Recent unsupervised neural machine translation techniques have overcome data limitations.
We propose a novel model called Code Distillation (CoDist)
arXiv Detail & Related papers (2023-10-17T04:59:15Z) - CodeTransOcean: A Comprehensive Multilingual Benchmark for Code
Translation [8.979765541978292]
CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation.
CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
arXiv Detail & Related papers (2023-10-08T00:16:18Z) - The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another.
This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z) - Original or Translated? On the Use of Parallel Data for Translation
Quality Estimation [81.27850245734015]
We demonstrate a significant gap between parallel data and real QE data.
For parallel data, it is indiscriminate and the translationese may occur on either source or target side.
We find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart.
arXiv Detail & Related papers (2022-12-20T14:06:45Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z) - Unsupervised Translation of Programming Languages [19.56070393390029]
A transcompiler, also known as source-to-source, is a system that converts source code from a high-level programming language to another.
We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy.
arXiv Detail & Related papers (2020-06-05T15:28:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.