Related papers: Program Translation via Code Distillation

Program Translation via Code Distillation

URL: http://arxiv.org/abs/2310.11476v1
Date: Tue, 17 Oct 2023 04:59:15 GMT
Title: Program Translation via Code Distillation
Authors: Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, Neel Sundaresan
Abstract summary: Traditional machine translation relies on parallel corpora for supervised translation. Recent unsupervised neural machine translation techniques have overcome data limitations. We propose a novel model called Code Distillation (CoDist)
Score: 20.668229308907495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for program translation due to a dearth of aligned data. Recent unsupervised neural machine translation techniques have overcome data limitations by included techniques such as back translation and low level compiler intermediate representations (IR). These methods face significant challenges due to the noise in code snippet alignment and the diversity of IRs respectively. In this paper we propose a novel model called Code Distillation (CoDist) whereby we capture the semantic and structural equivalence of code in a language agnostic intermediate representation. Distilled code serves as a translation pivot for any programming language, leading by construction to parallel corpora which scale to all available source code by simply applying the distillation compiler. We demonstrate that our approach achieves state-of-the-art performance on CodeXGLUE and TransCoder GeeksForGeeks translation benchmarks, with an average absolute increase of 12.7% on the TransCoder GeeksforGeeks translation benchmark compare to TransCoder-ST.

Related papers

Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z)
LLM-Driven Multi-step Translation from C to Rust using Static Analysis [27.122409727034192]
Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety. We propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology. SACTOR produces more natural and Rust-compliant translations compared to existing methods.
arXiv Detail & Related papers (2025-03-16T14:05:26Z)
Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation [55.73341401764367]
We introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data.<n> DCSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes.<n>Experiments demonstrate that DCSQE outperforms SOTA baselines in both supervised and unsupervised settings.
arXiv Detail & Related papers (2025-02-27T10:11:53Z)
I Can't Share Code, but I need Translation -- An Empirical Study on Code Translation through Federated LLM [3.9373541926236766]
This study demonstrates that participants can collaboratively develop a FedLLM for efficient code translation. Our findings indicate that FedLLM offers a collaborative approach to code translation and could serve as a promising direction for future research in this field.
arXiv Detail & Related papers (2025-01-10T05:43:36Z)
Data Augmentation for Code Translation with Comparable Corpora and Multiple References [21.754147577489764]
We build and analyze multiple types of comparable corpora, including programs generated from natural language documentation. To reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy.
arXiv Detail & Related papers (2023-11-01T06:01:22Z)
CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation [8.979765541978292]
CodeTransOcean is a large-scale comprehensive dataset that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs)
arXiv Detail & Related papers (2023-10-08T00:16:18Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another. This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z)
On ML-Based Program Translation: Perils and Promises [17.818482089078028]
This work investigates unsupervised program translators and where and why they fail. We develop a rule-based program mutation engine, which pre-processes the input code if the input follows specific patterns and post-process the output if the output follows certain patterns. In the future, we envision an end-to-end program translation tool where programming domain knowledge can be embedded into an ML-based translation pipeline.
arXiv Detail & Related papers (2023-02-21T16:42:20Z)
Code Translation with Compiler Representations [21.702473137941006]
Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages.
arXiv Detail & Related papers (2022-06-30T14:21:57Z)
Principled Paraphrase Generation with Parallel Corpora [52.78059089341062]
We formalize the implicit similarity function induced by round-trip Machine Translation. We show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. We design an alternative similarity metric that mitigates this issue.
arXiv Detail & Related papers (2022-05-24T17:22:42Z)
Using Document Similarity Methods to create Parallel Datasets for Code Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task. We propose to use document similarity methods to create noisy parallel datasets of code. We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z)
Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages. We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data. Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task. Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.