Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery
- URL: http://arxiv.org/abs/2404.19025v1
- Date: Mon, 29 Apr 2024 18:09:28 GMT
- Title: Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery
- Authors: Iftakhar Ahmad, Lannan Luo,
- Abstract summary: Cross-architecture binary code analysis has become an emerging problem.
Deep learning-based binary analysis has shown promising success.
For some low-resource ISAs, an adequate amount of data is hard to find.
- Score: 2.022692275087205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Binary code analysis has immense importance in the research domain of software security. Today, software is very often compiled for various Instruction Set Architectures (ISAs). As a result, cross-architecture binary code analysis has become an emerging problem. Recently, deep learning-based binary analysis has shown promising success. It is widely known that training a deep learning model requires a massive amount of data. However, for some low-resource ISAs, an adequate amount of data is hard to find, preventing deep learning from being widely adopted for binary analysis. To overcome the data scarcity problem and facilitate cross-architecture binary code analysis, we propose to apply the ideas and techniques in Neural Machine Translation (NMT) to binary code analysis. Our insight is that a binary, after disassembly, is represented in some assembly language. Given a binary in a low-resource ISA, we translate it to a binary in a high-resource ISA (e.g., x86). Then we can use a model that has been trained on the high-resource ISA to test the translated binary. We have implemented the model called UNSUPERBINTRANS, and conducted experiments to evaluate its performance. Specifically, we conducted two downstream tasks, including code similarity detection and vulnerability discovery. In both tasks, we achieved high accuracies.
Related papers
- How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - BiBench: Benchmarking and Analyzing Network Binarization [72.59760752906757]
Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings.
Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood.
We present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization.
arXiv Detail & Related papers (2023-01-26T17:17:16Z) - Leveraging Artificial Intelligence on Binary Code Comprehension [5.236023714727536]
We propose to develop Artificial Intelligence (AI) models that aid human comprehension of binary code.
Specifically, we propose to incorporate domain knowledge from large corpora of source code (e.g., variable names, comments) to build AI models that capture a generalizable representation of binary code.
Lastly, we will investigate metrics to assess the performance of models that apply to binary code by using human studies of comprehension.
arXiv Detail & Related papers (2022-10-11T02:39:29Z) - Pre-Training Representations of Binary Code Using Contrastive Learning [14.1548548120994]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning.
COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z) - Towards Accurate Binary Neural Networks via Modeling Contextual
Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function.
We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z) - A Natural Language Processing Approach for Instruction Set Architecture
Identification [6.495883501989546]
We introduce character-level features of encoded binaries to identify fine-grained bit patterns inherent to each ISA.
Our approach results in an 8% higher accuracy than the state-of-the-art features based on byte-histograms and byte pattern signatures.
arXiv Detail & Related papers (2022-04-13T19:45:06Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes [55.32790803903619]
We propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes.
Our method does not require any side-information, like annotated attributes or label meta-data.
We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes.
arXiv Detail & Related papers (2021-06-02T21:57:52Z) - BATS: Binary ArchitecTure Search [56.87581500474093]
We show that directly applying Neural Architecture Search to the binary domain provides very poor results.
Specifically, we introduce and design a novel binary-oriented search space.
We also set a new state-of-the-art for binary neural networks on CIFAR10, CIFAR100 and ImageNet datasets.
arXiv Detail & Related papers (2020-03-03T18:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.