Related papers: Bin2vec: Learning Representations of Binary Executable Programs for Security Tasks

Bin2vec: Learning Representations of Binary Executable Programs for Security Tasks

URL: http://arxiv.org/abs/2002.03388v2
Date: Sat, 22 May 2021 17:27:57 GMT
Title: Bin2vec: Learning Representations of Binary Executable Programs for Security Tasks
Authors: Shushan Arakelyan, Sima Arasteh, Christophe Hauser, Erik Kline and Aram Galstyan
Abstract summary: We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach.
Score: 15.780176500971244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tackling binary program analysis problems has traditionally implied manually defining rules and heuristics, a tedious and time-consuming task for human analysts. In order to improve automation and scalability, we propose an alternative direction based on distributed representations of binary programs with applicability to a number of downstream tasks. We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs in order to learn a high dimensional representation of binary executable programs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks - functional algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results and demonstrate improvement over state-of-the-art methods for both tasks. We evaluated Bin2vec on 49191 binaries for the functional algorithm classification task, and on 30 different CWE-IDs including at least 100 CVE entries each for the vulnerability discovery task. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach, while working on binary code. For almost every vulnerability class in our dataset, our prediction accuracy is over 80% (and over 90% in multiple classes).

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code [55.493408628371235]
We propose ByteTR, a framework for recovering variable types in binary code. In light of the ubiquity of variable propagation across functions, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery.
arXiv Detail & Related papers (2025-03-10T12:27:05Z)
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking [54.354203142828084]
We present the task of equivalence checking as a new way to evaluate the code reasoning abilities of large language models. We introduce EquiBench, a dataset of 2400 program pairs spanning four programming languages and six equivalence categories. Our evaluation of 17 state-of-the-art LLMs shows that OpenAI o3-mini achieves the highest overall accuracy of 78.0%.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Training Multi-Layer Binary Neural Networks With Local Binary Error Signals [3.7740044597960316]
Binary Neural Networks (BNNs) reduce computational and memory usage in machine and deep learning by representing activations with just one bit. Most existing training algorithms for BNNs rely on floating-point Descent (SGD) limiting the full exploitation of binary operations. We propose for the first time a fully binary and gradient-free algorithm for training BNNs.
arXiv Detail & Related papers (2024-11-28T09:12:04Z)
BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB. Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets. The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z)
Know Your Neighborhood: General and Zero-Shot Capable Binary Function Search Powered by Call Graphlets [0.7646713951724013]
This paper proposes a novel graph neural network architecture combined with a novel graph data representation called call graphlets. A specialized graph neural network model operates on this graph representation, learning to map it to a feature vector that encodes semantic binary code similarities. Experimental results show that the combination of call graphlets and the novel graph neural network architecture achieves comparable or state-of-the-art performance.
arXiv Detail & Related papers (2024-06-02T18:26:50Z)
PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model [25.014876893315208]
We propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings.
arXiv Detail & Related papers (2023-08-29T17:20:35Z)
Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks. The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z)
UniASM: Binary Code Similarity Detection without Fine-tuning [0.8271859911016718]
We propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions. In the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
arXiv Detail & Related papers (2022-10-28T14:04:57Z)
Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function. We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z)
Benchmarking Node Outlier Detection on Graphs [90.29966986023403]
Graph outlier detection is an emerging but crucial machine learning task with numerous applications. We present the first comprehensive unsupervised node outlier detection benchmark for graphs called UNOD.
arXiv Detail & Related papers (2022-06-21T01:46:38Z)
Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z)
Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z)
Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks. In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other. This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.