Bin2vec: Learning Representations of Binary Executable Programs for
Security Tasks
- URL: http://arxiv.org/abs/2002.03388v2
- Date: Sat, 22 May 2021 17:27:57 GMT
- Title: Bin2vec: Learning Representations of Binary Executable Programs for
Security Tasks
- Authors: Shushan Arakelyan, Sima Arasteh, Christophe Hauser, Erik Kline and
Aram Galstyan
- Abstract summary: We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs.
We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks.
We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach.
- Score: 15.780176500971244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tackling binary program analysis problems has traditionally implied manually
defining rules and heuristics, a tedious and time-consuming task for human
analysts. In order to improve automation and scalability, we propose an
alternative direction based on distributed representations of binary programs
with applicability to a number of downstream tasks. We introduce Bin2vec, a new
approach leveraging Graph Convolutional Networks (GCN) along with computational
program graphs in order to learn a high dimensional representation of binary
executable programs. We demonstrate the versatility of this approach by using
our representations to solve two semantically different binary analysis tasks -
functional algorithm classification and vulnerability discovery. We compare the
proposed approach to our own strong baseline as well as published results and
demonstrate improvement over state-of-the-art methods for both tasks. We
evaluated Bin2vec on 49191 binaries for the functional algorithm classification
task, and on 30 different CWE-IDs including at least 100 CVE entries each for
the vulnerability discovery task. We set a new state-of-the-art result by
reducing the classification error by 40% compared to the source-code-based
inst2vec approach, while working on binary code. For almost every vulnerability
class in our dataset, our prediction accuracy is over 80% (and over 90% in
multiple classes).
Related papers
- BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB.
Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets.
The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z) - BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching [8.655595404611821]
We introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features.
Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task.
arXiv Detail & Related papers (2024-01-20T07:57:57Z) - PEM: Representing Binary Program Semantics for Similarity Analysis via a
Probabilistic Execution Model [25.014876893315208]
We propose a new method to represent binary program semantics.
It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries.
Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings.
arXiv Detail & Related papers (2023-08-29T17:20:35Z) - Towards Better Out-of-Distribution Generalization of Neural Algorithmic
Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks.
The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions.
In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - Towards Accurate Binary Neural Networks via Modeling Contextual
Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function.
We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z) - Benchmarking Node Outlier Detection on Graphs [90.29966986023403]
Graph outlier detection is an emerging but crucial machine learning task with numerous applications.
We present the first comprehensive unsupervised node outlier detection benchmark for graphs called UNOD.
arXiv Detail & Related papers (2022-06-21T01:46:38Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks.
In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other.
This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.