Syntax and Domain Aware Model for Unsupervised Program Translation
- URL: http://arxiv.org/abs/2302.03908v1
- Date: Wed, 8 Feb 2023 06:54:55 GMT
- Title: Syntax and Domain Aware Model for Unsupervised Program Translation
- Authors: Fang Liu, Jia Li, Li Zhang
- Abstract summary: We propose SDA-Trans, a syntax and domain-aware model for program translation.
It leverages the syntax structure and domain knowledge to enhance the cross-lingual transfer ability.
The experimental results on function translation tasks between Python, Java, and C++ show that SDA-Trans outperforms many large-scale pre-trained models.
- Score: 23.217899398362206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is growing interest in software migration as the development of
software and society. Manually migrating projects between languages is
error-prone and expensive. In recent years, researchers have begun to explore
automatic program translation using supervised deep learning techniques by
learning from large-scale parallel code corpus. However, parallel resources are
scarce in the programming language domain, and it is costly to collect
bilingual data manually. To address this issue, several unsupervised
programming translation systems are proposed. However, these systems still rely
on huge monolingual source code to train, which is very expensive. Besides,
these models cannot perform well for translating the languages that are not
seen during the pre-training procedure. In this paper, we propose SDA-Trans, a
syntax and domain-aware model for program translation, which leverages the
syntax structure and domain knowledge to enhance the cross-lingual transfer
ability. SDA-Trans adopts unsupervised training on a smaller-scale corpus,
including Python and Java monolingual programs. The experimental results on
function translation tasks between Python, Java, and C++ show that SDA-Trans
outperforms many large-scale pre-trained models, especially for unseen language
translation.
Related papers
- Exploring and Unleashing the Power of Large Language Models in Automated Code Translation [40.25727029618665]
This paper investigates diverse LLMs and learning-based transpilers for automated code translation tasks.
UniTrans is a Unified code Translation framework, applicable to various LLMs.
Three recent LLMs of diverse sizes are tested with UniTrans, and all achieve substantial improvements.
arXiv Detail & Related papers (2024-04-23T00:49:46Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - Zero-shot Cross-lingual Transfer without Parallel Corpus [6.937772043639308]
We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model.
It consists of a Bilingual Task Fitting module that applies task-related bilingual information alignment.
A self-training module generates pseudo soft and hard labels for unlabeled data and utilizes them to conduct self-training.
arXiv Detail & Related papers (2023-10-07T07:54:22Z) - On ML-Based Program Translation: Perils and Promises [17.818482089078028]
This work investigates unsupervised program translators and where and why they fail.
We develop a rule-based program mutation engine, which pre-processes the input code if the input follows specific patterns and post-process the output if the output follows certain patterns.
In the future, we envision an end-to-end program translation tool where programming domain knowledge can be embedded into an ML-based translation pipeline.
arXiv Detail & Related papers (2023-02-21T16:42:20Z) - Summarize and Generate to Back-translate: Unsupervised Translation of
Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available.
We propose performing back-translation via code summarization and generation.
We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Unsupervised Translation of Programming Languages [19.56070393390029]
A transcompiler, also known as source-to-source, is a system that converts source code from a high-level programming language to another.
We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy.
arXiv Detail & Related papers (2020-06-05T15:28:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.