Leveraging Automated Unit Tests for Unsupervised Code Translation
- URL: http://arxiv.org/abs/2110.06773v1
- Date: Wed, 13 Oct 2021 15:08:43 GMT
- Title: Leveraging Automated Unit Tests for Unsupervised Code Translation
- Authors: Baptiste Roziere, Jie M. Zhang, Francois Charton, Mark Harman, Gabriel
Synnaeve, Guillaume Lample
- Abstract summary: We propose to leverage an automated unit-testing system to filter out invalid translations.
We find that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated.
In particular, for Java $to$ Python and Python $to$ C++ we outperform the best previous methods by more than 16% and 24% respectively.
- Score: 34.84910520660154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With little to no parallel data available for programming languages,
unsupervised methods are well-suited to source code translation. However, the
majority of unsupervised machine translation approaches rely on
back-translation, a method developed in the context of natural language
translation and one that inherently involves training on noisy inputs.
Unfortunately, source code is highly sensitive to small changes; a single token
can result in compilation failures or erroneous programs, unlike natural
languages where small inaccuracies may not change the meaning of a sentence. To
address this issue, we propose to leverage an automated unit-testing system to
filter out invalid translations, thereby creating a fully tested parallel
corpus. We found that fine-tuning an unsupervised model with this filtered data
set significantly reduces the noise in the translations so-generated,
comfortably outperforming the state-of-the-art for all language pairs studied.
In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the
best previous methods by more than 16% and 24% respectively, reducing the error
rate by more than 35%.
Related papers
- Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping [60.458273797431836]
Decoding by contrasting layers (DoLa) is designed to improve the generation quality of large language models.
We find that this approach does not work well on non-English tasks.
Inspired by previous interpretability work on language transition during the model's forward pass, we propose an improved contrastive decoding algorithm.
arXiv Detail & Related papers (2024-07-15T15:14:01Z) - A Novel Approach for Automatic Program Repair using Round-Trip
Translation with Large Language Models [50.86686630756207]
Research shows that grammatical mistakes in a sentence can be corrected by translating it to another language and back.
Current generative models for Automatic Program Repair (APR) are pre-trained on source code and fine-tuned for repair.
This paper proposes bypassing the fine-tuning step and using Round-Trip Translation (RTT): translation of code from one programming language to another programming or natural language, and back.
arXiv Detail & Related papers (2024-01-15T22:36:31Z) - Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual
Translation of Dravidian Languages [0.34998703934432673]
We build a single-decoder neural machine translation system for Dravidian-Dravidian multilingual translation.
Our model achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50% of the language directions.
arXiv Detail & Related papers (2023-08-10T13:38:09Z) - Measuring The Impact Of Programming Language Distribution [28.96076723773365]
We present the BabelCode framework for execution-based evaluation of any benchmark in any language.
We present a new code translation dataset called Translating Python Programming Puzzles (TP3)
We investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages.
arXiv Detail & Related papers (2023-02-03T19:47:22Z) - Code Translation with Compiler Representations [21.702473137941006]
Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code.
Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation.
Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages.
arXiv Detail & Related papers (2022-06-30T14:21:57Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Detecting over/under-translation errors for determining adequacy in
human translations [0.0]
We present a novel approach to detecting over and under translations (OT/UT) as part of adequacy error checks in translation evaluation.
We do not restrict ourselves to machine translation (MT) outputs and specifically target applications with human generated translation pipeline.
The goal of our system is to identify OT/UT errors from human translated video subtitles with high error recall.
arXiv Detail & Related papers (2021-04-01T06:06:36Z) - Unsupervised Translation of Programming Languages [19.56070393390029]
A transcompiler, also known as source-to-source, is a system that converts source code from a high-level programming language to another.
We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy.
arXiv Detail & Related papers (2020-06-05T15:28:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.