Semantic-aware Binary Code Representation with BERT
- URL: http://arxiv.org/abs/2106.05478v1
- Date: Thu, 10 Jun 2021 03:31:29 GMT
- Title: Semantic-aware Binary Code Representation with BERT
- Authors: Hyungjoon Koo, Soyeon Park, Daejin Choi, Taesoo Kim
- Abstract summary: A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
- Score: 27.908093567605484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A wide range of binary analysis applications, such as bug discovery, malware
analysis and code clone detection, require recovery of contextual meanings on a
binary code. Recently, binary analysis techniques based on machine learning
have been proposed to automatically reconstruct the code representation of a
binary instead of manually crafting specifics of the analysis algorithm.
However, the existing approaches utilizing machine learning are still
specialized to solve one domain of problems, rendering recreation of models for
different types of binary analysis. In this paper, we propose DeepSemantic
utilizing BERT in producing the semantic-aware code representation of a binary
code.
To this end, we introduce well-balanced instruction normalization that holds
rich information for each of instructions yet minimizing an out-of-vocabulary
(OOV) problem. DeepSemantic has been carefully designed based on our study with
large swaths of binaries. Besides, DeepSemantic leverages the essence of the
BERT architecture into re-purposing a pre-trained generic model that is readily
available as a one-time processing, followed by quickly applying specific
downstream tasks with a fine-tuning process. We demonstrate DeepSemantic with
two downstream tasks, namely, binary similarity comparison and compiler
provenance (i.e., compiler and optimization level) prediction. Our experimental
results show that the binary similarity model outperforms two state-of-the-art
binary similarity tools, DeepBinDiff and SAFE, 49.84% and 15.83% on average,
respectively.
Related papers
- Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB.
Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets.
The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z) - Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery [2.022692275087205]
Cross-architecture binary code analysis has become an emerging problem.
Deep learning-based binary analysis has shown promising success.
For some low-resource ISAs, an adequate amount of data is hard to find.
arXiv Detail & Related papers (2024-04-29T18:09:28Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - BiBench: Benchmarking and Analyzing Network Binarization [72.59760752906757]
Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings.
Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood.
We present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization.
arXiv Detail & Related papers (2023-01-26T17:17:16Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions.
In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - Pre-Training Representations of Binary Code Using Contrastive Learning [13.570375923483452]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning.
COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z) - Towards Accurate Binary Neural Networks via Modeling Contextual
Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function.
We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z) - Bin2vec: Learning Representations of Binary Executable Programs for
Security Tasks [15.780176500971244]
We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs.
We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks.
We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach.
arXiv Detail & Related papers (2020-02-09T15:46:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.