Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection
- URL: http://arxiv.org/abs/2511.22095v1
- Date: Thu, 27 Nov 2025 04:33:16 GMT
- Title: Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection
- Authors: Michael J. Bommarito,
- Abstract summary: Binary-30K is the first heterogeneous binary dataset designed for sequence-based models like transformers.<n>With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding.<n>The dataset is publicly available at https://huggingface.co/datasets/mjbommar/binary-30k, providing an accessible resource for researchers, practitioners, and students alike.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/mjbommar/binary-30k, providing an accessible resource for researchers, practitioners, and students alike.
Related papers
- OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - Cross-modal Retrieval Models for Stripped Binary Analysis [62.89251403093734]
BinSeek is the first two-stage cross-modal retrieval framework for stripped binary code analysis.<n>It consists of two models: BinSeekEmbedding is trained on large-scale dataset to learn the semantic relevance of the binary code.<n>BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation.
arXiv Detail & Related papers (2025-12-11T07:58:10Z) - Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis [0.0]
We introduce the Binary BPE tokenizer family, a set of cross-platform tokenizers for executables trained on a large corpus of binaries.<n>We release trained tokenizers with vocabularies of 4K, 8K, 16K, 32K, and 64K tokens, enabling both systematic scaling studies and practical deployment.
arXiv Detail & Related papers (2025-11-14T22:53:03Z) - Deepfake Detection that Generalizes Across Benchmarks [48.85953407706351]
The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment.<n>This work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders.<n>The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC.
arXiv Detail & Related papers (2025-08-08T12:03:56Z) - Assemblage: Automatic Binary Dataset Construction for Machine Learning [35.674339346299654]
Assemblage is a cloud-based distributed system that crawls, configures, and builds Windows PE binaries.
We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations.
arXiv Detail & Related papers (2024-05-07T04:10:01Z) - BiBench: Benchmarking and Analyzing Network Binarization [72.59760752906757]
Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings.
Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood.
We present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization.
arXiv Detail & Related papers (2023-01-26T17:17:16Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [2.2329530239800035]
We propose a novel rich-semantic function representation technique to ensure the model captures the intricate nuances of binary code.<n>We introduce the first UniLM-based binary code embedding model, named UniASM, which includes two newly designed training tasks.<n>The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approaches on the evaluation datasets.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - CrossedWires: A Dataset of Syntactically Equivalent but Semantically
Disparate Deep Learning Models [4.487528701954606]
We present a living dataset of models and hyperparameters, called CrossedWires, that exposes semantic differences between two popular deep learning frameworks: PyTorch and Denseflow.
The CrossedWires dataset provides an opportunity to study semantic differences between syntactically equivalent models across popular deep learning frameworks.
arXiv Detail & Related papers (2021-08-29T07:27:16Z) - Rethinking Architecture Design for Tackling Data Heterogeneity in
Federated Learning [53.73083199055093]
We show that attention-based architectures (e.g., Transformers) are fairly robust to distribution shifts.
Our experiments show that replacing convolutional networks with Transformers can greatly reduce catastrophic forgetting of previous devices.
arXiv Detail & Related papers (2021-06-10T21:04:18Z) - End to End Binarized Neural Networks for Text Classification [4.046236197219608]
We propose an end to end binarized neural network architecture for the intent classification task.
The proposed architecture achieves comparable to the state-of-the-art results on standard intent classification datasets.
arXiv Detail & Related papers (2020-10-11T11:21:53Z) - Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning.
We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations.
We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z) - MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and
Architectures [61.73533544385352]
We propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data.
As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize heterogeneous tasks and architectures.
arXiv Detail & Related papers (2020-06-13T02:54:59Z) - Binarizing MobileNet via Evolution-based Searching [66.94247681870125]
We propose a use of evolutionary search to facilitate the construction and training scheme when binarizing MobileNet.
Inspired by one-shot architecture search frameworks, we manipulate the idea of group convolution to design efficient 1-Bit Convolutional Neural Networks (CNNs)
Our objective is to come up with a tiny yet efficient binary neural architecture by exploring the best candidates of the group convolution.
arXiv Detail & Related papers (2020-05-13T13:25:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.