RawMal-TF: Raw Malware Dataset Labeled by Type and Family
- URL: http://arxiv.org/abs/2506.23909v1
- Date: Mon, 30 Jun 2025 14:38:01 GMT
- Title: RawMal-TF: Raw Malware Dataset Labeled by Type and Family
- Authors: David Bálik, Martin Jureček, Mark Stamp,
- Abstract summary: This work addresses the challenge of malware classification using machine learning by developing a novel dataset labeled at both the malware type and family levels.<n>The dataset includes 14 malware types and 17 malware families, and was processed using a unified feature extraction pipeline.<n>In the binary classification of malware versus benign samples, Random Forest and XGBoost achieved high accuracy on the full datasets.
- Score: 1.2289361708127875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work addresses the challenge of malware classification using machine learning by developing a novel dataset labeled at both the malware type and family levels. Raw binaries were collected from sources such as VirusShare, VX Underground, and MalwareBazaar, and subsequently labeled with family information parsed from binary names and type-level labels integrated from ClarAVy. The dataset includes 14 malware types and 17 malware families, and was processed using a unified feature extraction pipeline based on static analysis, particularly extracting features from Portable Executable headers, to support advanced classification tasks. The evaluation was focused on three key classification tasks. In the binary classification of malware versus benign samples, Random Forest and XGBoost achieved high accuracy on the full datasets, reaching 98.5% for type-based detection and 98.98% for family-based detection. When using truncated datasets of 1,000 samples to assess performance under limited data conditions, both models still performed strongly, achieving 97.6% for type-based detection and 98.66% for family-based detection. For interclass classification, which distinguishes between malware types or families, the models reached up to 97.5% accuracy on type-level tasks and up to 93.7% on family-level tasks. In the multiclass classification setting, which assigns samples to the correct type or family, SVM achieved 81.1% accuracy on type labels, while Random Forest and XGBoost reached approximately 73.4% on family labels. The results highlight practical trade-offs between accuracy and computational cost, and demonstrate that labeling at both the type and family levels enables more fine-grained and insightful malware classification. The work establishes a robust foundation for future research on advanced malware detection and classification.
Related papers
- Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process.
With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance.
Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - Classification and Online Clustering of Zero-Day Malware [4.409836695738518]
This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them.
Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families.
arXiv Detail & Related papers (2023-05-01T00:00:07Z) - The Familiarity Hypothesis: Explaining the Behavior of Deep Open Set
Methods [86.39044549664189]
Anomaly detection algorithms for feature-vector data identify anomalies as outliers, but outlier detection has not worked well in deep learning.
This paper proposes the Familiarity Hypothesis that these methods succeed because they are detecting the absence of familiar learned features rather than the presence of novelty.
The paper concludes with a discussion of whether familiarity detection is an inevitable consequence of representation learning.
arXiv Detail & Related papers (2022-03-04T18:32:58Z) - New Datasets for Dynamic Malware Classification [0.0]
We introduce two new, updated datasets of malicious software, VirusSamples and VirusShare.
This paper analyzes multi-class malware classification performance of the balanced and imbalanced version of these two datasets.
Results show that Support Vector Machine, achieves the highest score of 94% in the imbalanced VirusSample dataset.
XGBoost, one of the most common gradient boosting-based models, achieves the highest score of 90% and 80%.in both versions of the VirusShare dataset.
arXiv Detail & Related papers (2021-11-30T08:31:16Z) - MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels [21.050311121388813]
We have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset.
MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset.
We provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools.
arXiv Detail & Related papers (2021-11-29T23:59:50Z) - Label-Assemble: Leveraging Multiple Datasets with Partial Labels [68.46767639240564]
"Label-Assemble" aims to unleash the full potential of partial labels from an assembly of public datasets.
We discovered that learning from negative examples facilitates both computer-aided disease diagnosis and detection.
arXiv Detail & Related papers (2021-09-25T02:48:17Z) - Cancer Gene Profiling through Unsupervised Discovery [49.28556294619424]
We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers.
Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm.
Our signature reports promising results on distinguishing immune inflammatory and immune desert tumors.
arXiv Detail & Related papers (2021-02-11T09:04:45Z) - Chest x-ray automated triage: a semiologic approach designed for
clinical implementation, exploiting different types of labels through a
combination of four Deep Learning architectures [83.48996461770017]
This work presents a Deep Learning method based on the late fusion of different convolutional architectures.
We built four training datasets combining images from public chest x-ray datasets and our institutional archive.
We trained four different Deep Learning architectures and combined their outputs with a late fusion strategy, obtaining a unified tool.
arXiv Detail & Related papers (2020-12-23T14:38:35Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.