Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance
- URL: http://arxiv.org/abs/2307.14657v1
- Date: Thu, 27 Jul 2023 07:18:10 GMT
- Title: Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance
- Authors: Savino Dambra, Yufei Han, Simone Aonzo, Platon Kotzias, Antonino
Vitale, Juan Caballero, Davide Balzarotti, Leyla Bilge
- Abstract summary: We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
- Score: 25.184668510417545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many studies have proposed machine-learning (ML) models for malware detection
and classification, reporting an almost-perfect performance. However, they
assemble ground-truth in different ways, use diverse static- and
dynamic-analysis techniques for feature extraction, and even differ on what
they consider a malware family. As a consequence, our community still lacks an
understanding of malware classification results: whether they are tied to the
nature and distribution of the collected dataset, to what extent the number of
families and samples in the training dataset influence performance, and how
well static and dynamic features complement each other.
This work sheds light on those open questions. by investigating the key
factors influencing ML-based malware detection and classification. For this, we
collect the largest balanced malware dataset so far with 67K samples from 670
families (100 samples each), and train state-of-the-art models for malware
detection and family classification using our dataset. Our results reveal that
static features perform better than dynamic features, and that combining both
only provides marginal improvement over static features. We discover no
correlation between packing and classification accuracy, and that missing
behaviors in dynamically-extracted features highly penalize their performance.
We also demonstrate how a larger number of families to classify make the
classification harder, while a higher number of samples per family increases
accuracy. Finally, we find that models trained on a uniform distribution of
samples per family better generalize on unseen data.
Related papers
- How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Nebula: Self-Attention for Dynamic Malware Analysis [14.710331873072146]
We introduce Nebula, a versatile, self-attention Transformer-based neural architecture that generalizes across different behavioral representations and formats.
We perform experiments on both malware detection and classification tasks, using three datasets acquired from different dynamic analyses platforms.
We showcase how self-supervised learning pre-training matches the performance of fully-supervised models with only 20% of training data.
arXiv Detail & Related papers (2023-09-19T09:24:36Z) - Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process.
With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance.
Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Multifamily Malware Models [5.414308305392762]
We conduct experiments based on byte $n$-gram features to quantify the relationship between the generality of the training dataset and the accuracy of the corresponding machine learning models.
We find that neighborhood-based algorithms generalize surprisingly well, far outperforming the other machine learning techniques considered.
arXiv Detail & Related papers (2022-06-27T13:06:31Z) - HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques [48.82319198853359]
HardVis is a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios.
Users can explore subsets of data from different perspectives to decide all those parameters.
The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case.
arXiv Detail & Related papers (2022-03-29T17:04:16Z) - Data-Centric Machine Learning in the Legal Domain [0.2624902795082451]
This paper explores how changes in a data set influence the measured performance of a model.
Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance.
The observed effects are surprisingly pronounced, especially when the per-class performance is considered.
arXiv Detail & Related papers (2022-01-17T23:05:14Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.