Classification and Online Clustering of Zero-Day Malware
- URL: http://arxiv.org/abs/2305.00605v2
- Date: Thu, 3 Aug 2023 12:04:46 GMT
- Title: Classification and Online Clustering of Zero-Day Malware
- Authors: Olha Jure\v{c}kov\'a, Martin Jure\v{c}ek, Mark Stamp, Fabio Di Troia,
R\'obert L\'orencz
- Abstract summary: This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them.
Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families.
- Score: 4.409836695738518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A large amount of new malware is constantly being generated, which must not
only be distinguished from benign samples, but also classified into malware
families. For this purpose, investigating how existing malware families are
developed and examining emerging families need to be explored. This paper
focuses on the online processing of incoming malicious samples to assign them
to existing families or, in the case of samples from new families, to cluster
them. We experimented with seven prevalent malware families from the EMBER
dataset, four in the training set and three additional new families in the test
set. Based on the classification score of the multilayer perceptron, we
determined which samples would be classified and which would be clustered into
new malware families. We classified 97.21% of streaming data with a balanced
accuracy of 95.33%. Then, we clustered the remaining data using a
self-organizing map, achieving a purity from 47.61% for four clusters to 77.68%
for ten clusters. These results indicate that our approach has the potential to
be applied to the classification and clustering of zero-day malware into
malware families.
Related papers
- Online Clustering of Known and Emerging Malware Families [1.2289361708127875]
It is essential to categorize malware samples according to their malicious characteristics.
Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats.
This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families.
arXiv Detail & Related papers (2024-05-06T09:20:17Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process.
With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance.
Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z) - CNS-Net: Conservative Novelty Synthesizing Network for Malware
Recognition in an Open-set Scenario [14.059646012441313]
We study the challenging task of malware recognition on both known and novel unknown malware families, called malware open-set recognition (MOSR)
In this paper, we propose a novel model that can conservatively synthesize malware instances to mimic unknown malware families.
We also build a new large-scale malware dataset, named MAL-100, to fill the gap of lacking large open-set malware benchmark dataset.
arXiv Detail & Related papers (2023-05-02T07:31:42Z) - Clustering based opcode graph generation for malware variant detection [1.179179628317559]
We propose a methodology to perform malware detection and family attribution.
The proposed methodology first performs the extraction of opcodes from malwares in each family and constructs their respective opcode graphs.
We explore the use of clustering algorithms on the opcode graphs to detect clusters of malwares within the same malware family.
arXiv Detail & Related papers (2022-11-18T06:12:33Z) - Cluster Analysis of Malware Family Relationships [4.111899441919165]
We consider a dataset comprising20 malware families with1000 samples per family.
We perform clustering based on pairs of families and use the results to determine relationships between families.
Our results indicate that $K$-means clustering can be a powerful tool for data exploration of malware family relationships.
arXiv Detail & Related papers (2021-03-07T14:51:01Z) - Deep Semi-Supervised Embedded Clustering (DSEC) for Stratification of
Heart Failure Patients [50.48904066814385]
In this work we apply deep semi-supervised embedded clustering to determine data-driven patient subgroups of heart failure.
We find clinically relevant clusters from an embedded space derived from heterogeneous data.
The proposed algorithm can potentially find new undiagnosed subgroups of patients that have different outcomes.
arXiv Detail & Related papers (2020-12-24T12:56:46Z) - Automatic sleep stage classification with deep residual networks in a
mixed-cohort setting [63.52264764099532]
We developed a novel deep neural network model to assess the generalizability of several large-scale cohorts.
Overall classification accuracy improved with increasing fractions of training data.
arXiv Detail & Related papers (2020-08-21T10:48:35Z) - LSD-C: Linearly Separable Deep Clusters [145.89790963544314]
We present LSD-C, a novel method to identify clusters in an unlabeled dataset.
Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation.
We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
arXiv Detail & Related papers (2020-06-17T17:58:10Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.