Beyond the C: Retargetable Decompilation using Neural Machine
Translation
- URL: http://arxiv.org/abs/2212.08950v1
- Date: Sat, 17 Dec 2022 20:45:59 GMT
- Title: Beyond the C: Retargetable Decompilation using Neural Machine
Translation
- Authors: Iman Hosseini, Brendan Dolan-Gavitt
- Abstract summary: We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
- Score: 5.734661402742406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of reversing the compilation process, decompilation, is an
important tool in reverse engineering of computer software. Recently,
researchers have proposed using techniques from neural machine translation to
automate the process in decompilation. Although such techniques hold the
promise of targeting a wider range of source and assembly languages, to date
they have primarily targeted C code. In this paper we argue that existing
neural decompilers have achieved higher accuracy at the cost of requiring
language-specific domain knowledge such as tokenizers and parsers to build an
abstract syntax tree (AST) for the source language, which increases the
overhead of supporting new languages. We explore a different tradeoff that, to
the extent possible, treats the assembly and source languages as plain text,
and show that this allows us to build a decompiler that is easily retargetable
to new languages. We evaluate our prototype decompiler, Beyond The C (BTC), on
Go, Fortran, OCaml, and C, and examine the impact of parameters such as
tokenization and training data selection on the quality of decompilation,
finding that it achieves comparable decompilation results to prior work in
neural decompilation with significantly less domain knowledge. We will release
our training data, trained decompilation models, and code to help encourage
future research into language-agnostic decompilation.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - Leveraging Large Language Models for Code Translation and Software Development in Scientific Computing [0.9668407688201359]
generative artificial intelligence (GenAI) is poised to transform productivity in scientific computing.
We developed a tool, CodeScribe, which combines prompt engineering with user supervision to establish an efficient process for code conversion.
We also address the challenges of AI-driven code translation and highlight its benefits for enhancing productivity in scientific computing.
arXiv Detail & Related papers (2024-10-31T16:48:41Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Boosting Neural Networks to Decompile Optimized Binaries [13.255618541522436]
Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL)
We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
arXiv Detail & Related papers (2023-01-03T06:45:54Z) - Learning C to x86 Translation: An Experiment in Neural Compilation [3.997680012976965]
Code-to-code neural models have been used in code translation, code refinement and decompilation.
In this work, we explore neural compilation, building and evaluating Transformer models that learn how to produce x86 assembler from C code.
arXiv Detail & Related papers (2021-08-17T14:11:15Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z) - Unsupervised Translation of Programming Languages [19.56070393390029]
A transcompiler, also known as source-to-source, is a system that converts source code from a high-level programming language to another.
We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy.
arXiv Detail & Related papers (2020-06-05T15:28:01Z) - SCELMo: Source Code Embeddings from Language Models [33.673421734844474]
We introduce a new set of deep contextualized word representations for computer programs based on language models.
We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.
arXiv Detail & Related papers (2020-04-28T00:06:25Z) - Bi-Decoder Augmented Network for Neural Machine Translation [108.3931242633331]
We propose a novel Bi-Decoder Augmented Network (BiDAN) for the neural machine translation task.
Since each decoder transforms the representations of the input text into its corresponding language, jointly training with two target ends can make the shared encoder has the potential to produce a language-independent semantic space.
arXiv Detail & Related papers (2020-01-14T02:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.