Related papers: Semantic-aware Binary Code Representation with BERT

Semantic-aware Binary Code Representation with BERT

URL: http://arxiv.org/abs/2106.05478v1
Date: Thu, 10 Jun 2021 03:31:29 GMT
Title: Semantic-aware Binary Code Representation with BERT
Authors: Hyungjoon Koo, Soyeon Park, Daejin Choi, Taesoo Kim
Abstract summary: A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
Score: 27.908093567605484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary instead of manually crafting specifics of the analysis algorithm. However, the existing approaches utilizing machine learning are still specialized to solve one domain of problems, rendering recreation of models for different types of binary analysis. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code. To this end, we introduce well-balanced instruction normalization that holds rich information for each of instructions yet minimizing an out-of-vocabulary (OOV) problem. DeepSemantic has been carefully designed based on our study with large swaths of binaries. Besides, DeepSemantic leverages the essence of the BERT architecture into re-purposing a pre-trained generic model that is readily available as a one-time processing, followed by quickly applying specific downstream tasks with a fine-tuning process. We demonstrate DeepSemantic with two downstream tasks, namely, binary similarity comparison and compiler provenance (i.e., compiler and optimization level) prediction. Our experimental results show that the binary similarity model outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, 49.84% and 15.83% on average, respectively.

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code [55.493408628371235]
We propose ByteTR, a framework for recovering variable types in binary code. In light of the ubiquity of variable propagation across functions, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery.
arXiv Detail & Related papers (2025-03-10T12:27:05Z)
On the Role of Pre-trained Embeddings in Binary Code Analysis [7.161446721947512]
Pre-trained embeddings of assembly code have become a gold standard for solving binary code analysis tasks. In contrast to natural language processing, label information is not scarce for many tasks in binary code analysis. We systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions.
arXiv Detail & Related papers (2025-02-12T10:50:46Z)
Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z)
BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB. Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets. The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z)
Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery [2.022692275087205]
Cross-architecture binary code analysis has become an emerging problem. Deep learning-based binary analysis has shown promising success. For some low-resource ISAs, an adequate amount of data is hard to find.
arXiv Detail & Related papers (2024-04-29T18:09:28Z)
How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
BiBench: Benchmarking and Analyzing Network Binarization [72.59760752906757]
Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. We present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization.
arXiv Detail & Related papers (2023-01-26T17:17:16Z)
UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions. In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z)
Pre-Training Representations of Binary Code Using Contrastive Learning [13.570375923483452]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning. COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z)
Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function. We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z)
Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
Bin2vec: Learning Representations of Binary Executable Programs for Security Tasks [15.780176500971244]
We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach.
arXiv Detail & Related papers (2020-02-09T15:46:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.