Related papers: Cluster Analysis of Malware Family Relationships

Related papers

Clustering Malware at Scale: A First Full-Benchmark Study [0.0]
We evaluate malware clustering quality and establish the state-of-the-art on Bodmas and Ember - two large public benchmark malware datasets.<n>Our results indicate that incorporating benign samples does not significantly degrade clustering quality.<n>Contrary to popular opinion, our top clustering performers are K-Means and BIRCH, with DBSCAN and HAC falling behind.
arXiv Detail & Related papers (2025-11-28T14:02:17Z)
Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering [51.11677202873771]
Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets.<n>Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values.<n>This paper breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly revealing various cluster distributions.
arXiv Detail & Related papers (2025-11-12T06:57:24Z)
Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG [41.02368814412595]
Family-Specific String (FSS) features can be utilized in a manner similar to Retrieval-Augmented Generation (RAG) to facilitate family classification.<n>We develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices.
arXiv Detail & Related papers (2025-07-05T14:36:13Z)
Online Clustering of Known and Emerging Malware Families [1.2289361708127875]
It is essential to categorize malware samples according to their malicious characteristics. Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats. This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families.
arXiv Detail & Related papers (2024-05-06T09:20:17Z)
Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process. With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z)
Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each) We train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z)
Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis [69.07674653828565]
Machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances. In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis. We propose a family of regularization methods, NFL (doN't Forget your Language) to mitigate spurious correlations in text classification.
arXiv Detail & Related papers (2023-05-23T03:55:50Z)
Classification and Online Clustering of Zero-Day Malware [4.409836695738518]
This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families.
arXiv Detail & Related papers (2023-05-01T00:00:07Z)
Clustering based opcode graph generation for malware variant detection [1.179179628317559]
We propose a methodology to perform malware detection and family attribution. The proposed methodology first performs the extraction of opcodes from malwares in each family and constructs their respective opcode graphs. We explore the use of clustering algorithms on the opcode graphs to detect clusters of malwares within the same malware family.
arXiv Detail & Related papers (2022-11-18T06:12:33Z)
Estimating Structural Disparities for Face Models [54.062512989859265]
In machine learning, disparity metrics are often defined by measuring the difference in the performance or outcome of a model, across different sub-populations. We explore performing such analysis on computer vision models trained on human faces, and on tasks such as face attribute prediction and affect estimation.
arXiv Detail & Related papers (2022-04-13T05:30:53Z)
ACTIVE:Augmentation-Free Graph Contrastive Learning for Partial Multi-View Clustering [52.491074276133325]
We propose an augmentation-free graph contrastive learning framework to solve the problem of partial multi-view clustering. The proposed approach elevates instance-level contrastive learning and missing data inference to the cluster-level, effectively mitigating the impact of individual missing data on clustering.
arXiv Detail & Related papers (2022-03-01T02:32:25Z)
Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. We provide implementable differentially private clustering algorithms that provide utility when the data is "easy" We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z)
Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem. We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z)
LSD-C: Linearly Separable Deep Clusters [145.89790963544314]
We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
arXiv Detail & Related papers (2020-06-17T17:58:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.