Program Translation via Code Distillation
- URL: http://arxiv.org/abs/2310.11476v1
- Date: Tue, 17 Oct 2023 04:59:15 GMT
- Title: Program Translation via Code Distillation
- Authors: Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin
Clement, Neel Sundaresan
- Abstract summary: Traditional machine translation relies on parallel corpora for supervised translation.
Recent unsupervised neural machine translation techniques have overcome data limitations.
We propose a novel model called Code Distillation (CoDist)
- Score: 20.668229308907495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software version migration and program translation are an important and
costly part of the lifecycle of large codebases. Traditional machine
translation relies on parallel corpora for supervised translation, which is not
feasible for program translation due to a dearth of aligned data. Recent
unsupervised neural machine translation techniques have overcome data
limitations by included techniques such as back translation and low level
compiler intermediate representations (IR). These methods face significant
challenges due to the noise in code snippet alignment and the diversity of IRs
respectively. In this paper we propose a novel model called Code Distillation
(CoDist) whereby we capture the semantic and structural equivalence of code in
a language agnostic intermediate representation. Distilled code serves as a
translation pivot for any programming language, leading by construction to
parallel corpora which scale to all available source code by simply applying
the distillation compiler. We demonstrate that our approach achieves
state-of-the-art performance on CodeXGLUE and TransCoder GeeksForGeeks
translation benchmarks, with an average absolute increase of 12.7% on the
TransCoder GeeksforGeeks translation benchmark compare to TransCoder-ST.
Related papers
- Data Augmentation for Code Translation with Comparable Corpora and Multiple References [21.754147577489764]
We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation.
To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data.
Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
arXiv Detail & Related papers (2023-11-01T06:01:22Z) - CodeTransOcean: A Comprehensive Multilingual Benchmark for Code
Translation [8.979765541978292]
CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation.
CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
arXiv Detail & Related papers (2023-10-08T00:16:18Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another.
This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z) - On ML-Based Program Translation: Perils and Promises [17.818482089078028]
This work investigates unsupervised program translators and where and why they fail.
We develop a rule-based program mutation engine, which pre-processes the input code if the input follows specific patterns and post-process the output if the output follows certain patterns.
In the future, we envision an end-to-end program translation tool where programming domain knowledge can be embedded into an ML-based translation pipeline.
arXiv Detail & Related papers (2023-02-21T16:42:20Z) - Code Translation with Compiler Representations [21.702473137941006]
Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code.
Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation.
Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages.
arXiv Detail & Related papers (2022-06-30T14:21:57Z) - Principled Paraphrase Generation with Parallel Corpora [52.78059089341062]
We formalize the implicit similarity function induced by round-trip Machine Translation.
We show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation.
We design an alternative similarity metric that mitigates this issue.
arXiv Detail & Related papers (2022-05-24T17:22:42Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.