Code Translation with Compiler Representations
- URL: http://arxiv.org/abs/2207.03578v5
- Date: Mon, 24 Apr 2023 10:12:18 GMT
- Title: Code Translation with Compiler Representations
- Authors: Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton,
Patrick Labatut, Gabriel Synnaeve
- Abstract summary: Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code.
Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation.
Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages.
- Score: 21.702473137941006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we leverage low-level compiler intermediate representations
(IR) to improve code translation. Traditional transpilers rely on syntactic
information and handcrafted rules, which limits their applicability and
produces unnatural-looking code. Applying neural machine translation (NMT)
approaches to code has successfully broadened the set of programs on which one
can get a natural-looking translation. However, they treat the code as
sequences of text tokens, and still do not differentiate well enough between
similar pieces of code which have different semantics in different languages.
The consequence is low quality translation, reducing the practicality of NMT,
and stressing the need for an approach significantly increasing its accuracy.
Here we propose to augment code translation with IRs, specifically LLVM IR,
with results on the C++, Java, Rust, and Go languages. Our method improves upon
the state of the art for unsupervised code translation, increasing the number
of correct translations by 11% on average, and up to 79% for the Java -> Rust
pair with greedy decoding. We extend previous test sets for code translation,
by adding hundreds of Go and Rust functions. Additionally, we train models with
high performance on the problem of IR decompilation, generating programming
source code from IR, and study using IRs as intermediary pivot for translation.
Related papers
- CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming [15.391781573025787]
We introduce CodeRosetta, an encoder-decoder model designed specifically for translating between programming languages and their HPC extensions.
CodeRosetta is evaluated on C++ to parallel C++ translation tasks.
Our results show that CodeRosetta outperforms state-of-the-art baselines in C++ to translation.
arXiv Detail & Related papers (2024-10-27T17:34:07Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Program Translation via Code Distillation [20.668229308907495]
Traditional machine translation relies on parallel corpora for supervised translation.
Recent unsupervised neural machine translation techniques have overcome data limitations.
We propose a novel model called Code Distillation (CoDist)
arXiv Detail & Related papers (2023-10-17T04:59:15Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially
Code-Switched Data [26.38449396649045]
We show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages.
Motivated by this, we propose to train ranking models on artificially code-switched data instead.
arXiv Detail & Related papers (2023-05-09T09:32:19Z) - Leveraging Automated Unit Tests for Unsupervised Code Translation [34.84910520660154]
We propose to leverage an automated unit-testing system to filter out invalid translations.
We find that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated.
In particular, for Java $to$ Python and Python $to$ C++ we outperform the best previous methods by more than 16% and 24% respectively.
arXiv Detail & Related papers (2021-10-13T15:08:43Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z) - Unsupervised Translation of Programming Languages [19.56070393390029]
A transcompiler, also known as source-to-source, is a system that converts source code from a high-level programming language to another.
We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy.
arXiv Detail & Related papers (2020-06-05T15:28:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.