Boosting Neural Networks to Decompile Optimized Binaries
- URL: http://arxiv.org/abs/2301.00969v1
- Date: Tue, 3 Jan 2023 06:45:54 GMT
- Title: Boosting Neural Networks to Decompile Optimized Binaries
- Authors: Ying Cao, Ruigang Liang, Kai Chen, Peiwei Hu
- Abstract summary: Decompilation aims to transform a low-level program language (LPL) into its functionally-equivalent high-level program language (HPL)
We propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries.
- Score: 13.255618541522436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decompilation aims to transform a low-level program language (LPL) (eg.,
binary file) into its functionally-equivalent high-level program language (HPL)
(e.g., C/C++). It is a core technology in software security, especially in
vulnerability discovery and malware analysis. In recent years, with the
successful application of neural machine translation (NMT) models in natural
language processing (NLP), researchers have tried to build neural decompilers
by borrowing the idea of NMT. They formulate the decompilation process as a
translation problem between LPL and HPL, aiming to reduce the human cost
required to develop decompilation tools and improve their generalizability.
However, state-of-the-art learning-based decompilers do not cope well with
compiler-optimized binaries. Since real-world binaries are mostly
compiler-optimized, decompilers that do not consider optimized binaries have
limited practical significance. In this paper, we propose a novel
learning-based approach named NeurDP, that targets compiler-optimized binaries.
NeurDP uses a graph neural network (GNN) model to convert LPL to an
intermediate representation (IR), which bridges the gap between source code and
optimized binary. We also design an Optimized Translation Unit (OTU) to split
functions into smaller code fragments for better translation performance.
Evaluation results on datasets containing various types of statements show that
NeurDP can decompile optimized binaries with 45.21% higher accuracy than
state-of-the-art neural decompilation frameworks.
Related papers
- DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs [56.24431208419858]
We introduce underlinetextbfDirect Preference Learning with Only underlinetextbfSelf-Generated underlinetextbfTests and underlinetextbfCode (DSTC)
DSTC uses only self-generated code snippets and tests to construct reliable preference pairs.
arXiv Detail & Related papers (2024-11-20T02:03:16Z) - SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly [6.080751346188323]
This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code.
We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches.
arXiv Detail & Related papers (2023-05-21T17:31:39Z) - Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed.
We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords.
Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z) - Extending Source Code Pre-Trained Language Models to Summarise
Decompiled Binaries [4.0484792045035505]
We extend large pre-trained language models of source code to summarise decompiled binary functions.
We investigate the impact of input and data properties on the performance of such models.
BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code.
arXiv Detail & Related papers (2023-01-04T16:56:33Z) - Beyond the C: Retargetable Decompilation using Neural Machine
Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages.
We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation.
We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z) - Towards Accurate Binary Neural Networks via Modeling Contextual
Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function.
We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z) - Learning to Superoptimize Real-world Programs [79.4140991035247]
We propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models.
We introduce the Big Assembly benchmark, a dataset consisting of over 25K real-world functions mined from open-source projects in x86-64 assembly.
arXiv Detail & Related papers (2021-09-28T05:33:21Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Improving type information inferred by decompilers with supervised
machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files.
We build different classification models capable of inferring the high-level type returned by functions.
Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z) - PolyDL: Polyhedral Optimizations for Creation of High Performance DL
primitives [55.79741270235602]
We present compiler algorithms to automatically generate high performance implementations of Deep Learning primitives.
We develop novel data reuse analysis algorithms using the polyhedral model.
We also show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance.
arXiv Detail & Related papers (2020-06-02T06:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.