Cluster Analysis of Malware Family Relationships
- URL: http://arxiv.org/abs/2103.05761v1
- Date: Sun, 7 Mar 2021 14:51:01 GMT
- Title: Cluster Analysis of Malware Family Relationships
- Authors: Samanvitha Basole and Mark Stamp
- Abstract summary: We consider a dataset comprising20 malware families with1000 samples per family.
We perform clustering based on pairs of families and use the results to determine relationships between families.
Our results indicate that $K$-means clustering can be a powerful tool for data exploration of malware family relationships.
- Score: 4.111899441919165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we use $K$-means clustering to analyze various relationships
between malware samples. We consider a dataset comprising~20 malware families
with~1000 samples per family. These families can be categorized into seven
different types of malware. We perform clustering based on pairs of families
and use the results to determine relationships between families. We perform a
similar cluster analysis based on malware type. Our results indicate that
$K$-means clustering can be a powerful tool for data exploration of malware
family relationships.
Related papers
- Online Clustering of Known and Emerging Malware Families [1.2289361708127875]
It is essential to categorize malware samples according to their malicious characteristics.
Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats.
This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families.
arXiv Detail & Related papers (2024-05-06T09:20:17Z) - Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process.
With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance.
Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - Understanding and Mitigating Spurious Correlations in Text
Classification with Neighborhood Analysis [69.07674653828565]
Machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances.
In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis.
We propose a family of regularization methods, NFL (doN't Forget your Language) to mitigate spurious correlations in text classification.
arXiv Detail & Related papers (2023-05-23T03:55:50Z) - Classification and Online Clustering of Zero-Day Malware [4.409836695738518]
This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them.
Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families.
arXiv Detail & Related papers (2023-05-01T00:00:07Z) - Clustering based opcode graph generation for malware variant detection [1.179179628317559]
We propose a methodology to perform malware detection and family attribution.
The proposed methodology first performs the extraction of opcodes from malwares in each family and constructs their respective opcode graphs.
We explore the use of clustering algorithms on the opcode graphs to detect clusters of malwares within the same malware family.
arXiv Detail & Related papers (2022-11-18T06:12:33Z) - Estimating Structural Disparities for Face Models [54.062512989859265]
In machine learning, disparity metrics are often defined by measuring the difference in the performance or outcome of a model, across different sub-populations.
We explore performing such analysis on computer vision models trained on human faces, and on tasks such as face attribute prediction and affect estimation.
arXiv Detail & Related papers (2022-04-13T05:30:53Z) - ACTIVE:Augmentation-Free Graph Contrastive Learning for Partial
Multi-View Clustering [52.491074276133325]
We propose an augmentation-free graph contrastive learning framework to solve the problem of partial multi-view clustering.
The proposed approach elevates instance-level contrastive learning and missing data inference to the cluster-level, effectively mitigating the impact of individual missing data on clustering.
arXiv Detail & Related papers (2022-03-01T02:32:25Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - LSD-C: Linearly Separable Deep Clusters [145.89790963544314]
We present LSD-C, a novel method to identify clusters in an unlabeled dataset.
Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation.
We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
arXiv Detail & Related papers (2020-06-17T17:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.