On ML-Based Program Translation: Perils and Promises
- URL: http://arxiv.org/abs/2302.10812v1
- Date: Tue, 21 Feb 2023 16:42:20 GMT
- Title: On ML-Based Program Translation: Perils and Promises
- Authors: Aniketh Malyala and Katelyn Zhou and Baishakhi Ray and Saikat
Chakraborty
- Abstract summary: This work investigates unsupervised program translators and where and why they fail.
We develop a rule-based program mutation engine, which pre-processes the input code if the input follows specific patterns and post-process the output if the output follows certain patterns.
In the future, we envision an end-to-end program translation tool where programming domain knowledge can be embedded into an ML-based translation pipeline.
- Score: 17.818482089078028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of new and advanced programming languages, it becomes
imperative to migrate legacy software to new programming languages.
Unsupervised Machine Learning-based Program Translation could play an essential
role in such migration, even without a sufficiently sizeable reliable corpus of
parallel source code. However, these translators are far from perfect due to
their statistical nature. This work investigates unsupervised program
translators and where and why they fail. With in-depth error analysis of such
failures, we have identified that the cases where such translators fail follow
a few particular patterns. With this insight, we develop a rule-based program
mutation engine, which pre-processes the input code if the input follows
specific patterns and post-process the output if the output follows certain
patterns. We show that our code processing tool, in conjunction with the
program translator, can form a hybrid program translator and significantly
improve the state-of-the-art. In the future, we envision an end-to-end program
translation tool where programming domain knowledge can be embedded into an
ML-based translation pipeline using pre- and post-processing steps.
Related papers
- Exploring and Unleashing the Power of Large Language Models in Automated Code Translation [40.25727029618665]
This paper investigates diverse LLMs and learning-based transpilers for automated code translation tasks.
UniTrans is a Unified code Translation framework, applicable to various LLMs.
Three recent LLMs of diverse sizes are tested with UniTrans, and all achieve substantial improvements.
arXiv Detail & Related papers (2024-04-23T00:49:46Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - Syntax and Domain Aware Model for Unsupervised Program Translation [23.217899398362206]
We propose SDA-Trans, a syntax and domain-aware model for program translation.
It leverages the syntax structure and domain knowledge to enhance the cross-lingual transfer ability.
The experimental results on function translation tasks between Python, Java, and C++ show that SDA-Trans outperforms many large-scale pre-trained models.
arXiv Detail & Related papers (2023-02-08T06:54:55Z) - Summarize and Generate to Back-translate: Unsupervised Translation of
Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available.
We propose performing back-translation via code summarization and generation.
We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Tea: Program Repair Using Neural Network Based on Program Information
Attention Matrix [14.596847020236657]
We propose a unified representation to capture the syntax, data flow, and control flow aspects of software programs.
We then devise a method to use such a representation to guide the transformer model from NLP in better understanding and fixing buggy programs.
arXiv Detail & Related papers (2021-07-17T15:49:22Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Synthetic Datasets for Neural Program Synthesis [66.20924952964117]
We propose a new methodology for controlling and evaluating the bias of synthetic data distributions over both programs and specifications.
We demonstrate, using the Karel DSL and a small Calculator DSL, that training deep networks on these distributions leads to improved cross-distribution generalization performance.
arXiv Detail & Related papers (2019-12-27T21:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.