Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection
- URL: http://arxiv.org/abs/2309.06643v1
- Date: Tue, 12 Sep 2023 23:45:59 GMT
- Title: Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection
- Authors: Maksim E. Eren, Manish Bhattarai, Robert J. Joyce, Edward Raff, Charles Nicholas, Boian S. Alexandrov,
- Abstract summary: We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process.
With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance.
Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
- Score: 34.7994627734601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this paper, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.
Related papers
- Multi-label Classification for Android Malware Based on Active Learning [7.599125552187342]
We propose MLCDroid, an ML-based multi-label classification approach that can directly indicate the existence of pre-defined malicious behaviors.
We compare the results of 70 algorithm combinations to evaluate the effectiveness (best at 73.3%).
This is the first multi-label Android malware classification approach intending to provide more information on fine-grained malicious behaviors.
arXiv Detail & Related papers (2024-10-09T01:09:24Z) - Online Clustering of Known and Emerging Malware Families [1.2289361708127875]
It is essential to categorize malware samples according to their malicious characteristics.
Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats.
This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families.
arXiv Detail & Related papers (2024-05-06T09:20:17Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Anomaly Detection in Cybersecurity: Unsupervised, Graph-Based and
Supervised Learning Methods in Adversarial Environments [63.942632088208505]
Inherent to today's operating environment is the practice of adversarial machine learning.
In this work, we examine the feasibility of unsupervised learning and graph-based methods for anomaly detection.
We incorporate a realistic adversarial training mechanism when training our supervised models to enable strong classification performance in adversarial environments.
arXiv Detail & Related papers (2021-05-14T10:05:10Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - DAEMON: Dataset-Agnostic Explainable Malware Classification Using
Multi-Stage Feature Mining [3.04585143845864]
Malware classification is the task of determining to which family a new malicious variant belongs.
We present DAEMON, a novel dataset-agnostic malware classification tool.
arXiv Detail & Related papers (2020-08-04T21:57:30Z) - Provable tradeoffs in adversarially robust classification [96.48180210364893]
We develop and leverage new tools, including recent breakthroughs from probability theory on robust isoperimetry.
Our results reveal fundamental tradeoffs between standard and robust accuracy that grow when data is imbalanced.
arXiv Detail & Related papers (2020-06-09T09:58:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.