Related papers: Assemblage: Automatic Binary Dataset Construction for Machine Learning

Assemblage: Automatic Binary Dataset Construction for Machine Learning

URL: http://arxiv.org/abs/2405.03991v2
Date: Sat, 02 Nov 2024 21:13:59 GMT
Title: Assemblage: Automatic Binary Dataset Construction for Machine Learning
Authors: Chang Liu, Rebecca Saul, Yihao Sun, Edward Raff, Maya Fuchs, Townsend Southard Pantano, James Holt, Kristopher Micinski,
Abstract summary: Assemblage is a cloud-based distributed system that crawls, configures, and builds Windows PE binaries. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations.
Score: 35.674339346299654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpora of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage code is open sourced under the MIT license, and the dataset can be downloaded from https://assemblage-dataset.net

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
Bridging the PLC Binary Analysis Gap: A Cross-Compiler Dataset and Neural Framework for Industrial Control Systems [14.826593801448032]
PLC-BEAD is a dataset containing 2431 compiled binaries from 700+ PLC programs across four major industrial compilers. This novel dataset uniquely pairs each binary with its original Structured Text source code and standardized functionality labels. We demonstrate the dataset's utility through PLCEmbed, a transformer-based framework for binary code analysis.
arXiv Detail & Related papers (2025-02-27T03:27:37Z)
Levels of Binary Equivalence for the Comparison of Binaries from Alternative Builds [1.1405827621489222]
Build platform variability can strengthen security as it facilitates the detection of compromised build environments. The availability of multiple binaries built from the same sources creates new challenges and opportunities. To answer such questions requires a notion of equivalence between binaries.
arXiv Detail & Related papers (2024-10-11T00:16:26Z)
Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in Lifted Compiled Code [4.956066467858057]
This research explores vulnerability detection using natural language processing (NLP) embedding techniques with word2vec, BERT, and RoBERTa. Long short-term memory (LSTM) neural networks were trained on embeddings from encoders created using approximately 48k LLVM functions from the Juliet dataset.
arXiv Detail & Related papers (2024-05-31T03:57:19Z)
Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery [2.022692275087205]
Cross-architecture binary code analysis has become an emerging problem. Deep learning-based binary analysis has shown promising success. For some low-resource ISAs, an adequate amount of data is hard to find.
arXiv Detail & Related papers (2024-04-29T18:09:28Z)
How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching [8.655595404611821]
We introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task.
arXiv Detail & Related papers (2024-01-20T07:57:57Z)
BiBench: Benchmarking and Analyzing Network Binarization [72.59760752906757]
Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. We present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization.
arXiv Detail & Related papers (2023-01-26T17:17:16Z)
BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications. We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance. We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z)
Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function. We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z)
Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z)
Binary DAD-Net: Binarized Driveable Area Detection Network for Autonomous Driving [94.40107679615618]
This paper proposes a novel binarized driveable area detection network (binary DAD-Net) It uses only binary weights and activations in the encoder, the bottleneck, and the decoder part. It outperforms state-of-the-art semantic segmentation networks on public datasets.
arXiv Detail & Related papers (2020-06-15T07:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.