Related papers: Classification and Online Clustering of Zero-Day Malware

Classification and Online Clustering of Zero-Day Malware

URL: http://arxiv.org/abs/2305.00605v2
Date: Thu, 3 Aug 2023 12:04:46 GMT
Title: Classification and Online Clustering of Zero-Day Malware
Authors: Olha Jure\v{c}kov\'a, Martin Jure\v{c}ek, Mark Stamp, Fabio Di Troia, R\'obert L\'orencz
Abstract summary: This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families.
Score: 4.409836695738518
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

Related papers

Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG [17.816068374958043]
Family-Specific String (FSS) features could be utilized in a manner similar to Retrieval-Augmented Generation (RAG) to facilitate Malware Family Classification (MFC)<n>We develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
arXiv Detail & Related papers (2025-07-05T14:36:13Z)
RawMal-TF: Raw Malware Dataset Labeled by Type and Family [1.2289361708127875]
This work addresses the challenge of malware classification using machine learning by developing a novel dataset labeled at both the malware type and family levels.<n>The dataset includes 14 malware types and 17 malware families, and was processed using a unified feature extraction pipeline.<n>In the binary classification of malware versus benign samples, Random Forest and XGBoost achieved high accuracy on the full datasets.
arXiv Detail & Related papers (2025-06-30T14:38:01Z)
Online Clustering of Known and Emerging Malware Families [1.2289361708127875]
It is essential to categorize malware samples according to their malicious characteristics. Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats. This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families.
arXiv Detail & Related papers (2024-05-06T09:20:17Z)
Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines. Academic research is often restrained to public datasets on the order of ten thousand samples. We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z)
Single-Cell Deep Clustering Method Assisted by Exogenous Gene Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells. During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation. This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z)
Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process. With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z)
CNS-Net: Conservative Novelty Synthesizing Network for Malware Recognition in an Open-set Scenario [14.059646012441313]
We study the challenging task of malware recognition on both known and novel unknown malware families, called malware open-set recognition (MOSR) In this paper, we propose a novel model that can conservatively synthesize malware instances to mimic unknown malware families. We also build a new large-scale malware dataset, named MAL-100, to fill the gap of lacking large open-set malware benchmark dataset.
arXiv Detail & Related papers (2023-05-02T07:31:42Z)
Clustering based opcode graph generation for malware variant detection [1.179179628317559]
We propose a methodology to perform malware detection and family attribution. The proposed methodology first performs the extraction of opcodes from malwares in each family and constructs their respective opcode graphs. We explore the use of clustering algorithms on the opcode graphs to detect clusters of malwares within the same malware family.
arXiv Detail & Related papers (2022-11-18T06:12:33Z)
Cluster Analysis of Malware Family Relationships [4.111899441919165]
We consider a dataset comprising20 malware families with1000 samples per family. We perform clustering based on pairs of families and use the results to determine relationships between families. Our results indicate that $K$-means clustering can be a powerful tool for data exploration of malware family relationships.
arXiv Detail & Related papers (2021-03-07T14:51:01Z)
Deep Semi-Supervised Embedded Clustering (DSEC) for Stratification of Heart Failure Patients [50.48904066814385]
In this work we apply deep semi-supervised embedded clustering to determine data-driven patient subgroups of heart failure. We find clinically relevant clusters from an embedded space derived from heterogeneous data. The proposed algorithm can potentially find new undiagnosed subgroups of patients that have different outcomes.
arXiv Detail & Related papers (2020-12-24T12:56:46Z)
Automatic sleep stage classification with deep residual networks in a mixed-cohort setting [63.52264764099532]
We developed a novel deep neural network model to assess the generalizability of several large-scale cohorts. Overall classification accuracy improved with increasing fractions of training data.
arXiv Detail & Related papers (2020-08-21T10:48:35Z)
LSD-C: Linearly Separable Deep Clusters [145.89790963544314]
We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
arXiv Detail & Related papers (2020-06-17T17:58:10Z)
Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units. The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.