Related papers: Cluster Analysis and Concept Drift Detection in Malware

Cluster Analysis and Concept Drift Detection in Malware

URL: http://arxiv.org/abs/2502.14135v1
Date: Wed, 19 Feb 2025 22:42:30 GMT
Title: Cluster Analysis and Concept Drift Detection in Malware
Authors: Aniket Mishra, Mark Stamp,
Abstract summary: Concept drift refers to gradual or sudden changes in the properties of data that affect the accuracy of machine learning models.<n>We propose and analyze a clustering-based approach to detecting concept drift in the malware domain.
Score: 1.3812010983144798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Concept drift refers to gradual or sudden changes in the properties of data that affect the accuracy of machine learning models. In this paper, we address the problem of concept drift detection in the malware domain. Specifically, we propose and analyze a clustering-based approach to detecting concept drift. Using a subset of the KronoDroid dataset, malware samples are partitioned into temporal batches and analyzed using MiniBatch $K$-Means clustering. The silhouette coefficient is used as a metric to identify points in time where concept drift has likely occurred. To verify our drift detection results, we train learning models under three realistic scenarios, which we refer to as static training, periodic retraining, and drift-aware retraining. In each scenario, we consider four supervised classifiers, namely, Multilayer Perceptron (MLP), Support Vector Machine (SVM), Random Forest, and XGBoost. Experimental results demonstrate that drift-aware retraining guided by silhouette coefficient thresholding achieves classification accuracy far superior to static models, and generally within 1% of periodic retraining, while also being far more efficient than periodic retraining. These results provide strong evidence that our clustering-based approach is effective at detecting concept drift, while also illustrating a highly practical and efficient fully automated approach to improved malware classification via concept drift detection.

Related papers

Empirical Evaluation of Concept Drift in ML-Based Android Malware Detection [10.268191178804168]
This study examines the impact of concept drift on Android malware detection.<n>Factors influencing the drift include feature types, data environments, and detection methods.<n>No strong link was found between the type of algorithm used and concept drift.
arXiv Detail & Related papers (2025-07-30T15:35:51Z)
ADAPT: A Pseudo-labeling Approach to Combat Concept Drift in Malware Detection [0.8192907805418583]
Adapting machine learning models to changing data distributions requires frequent updates.<n>We introduce texttADAPT, a novel pseudo-labeling semi-supervised algorithm for addressing concept drift.
arXiv Detail & Related papers (2025-07-11T13:47:07Z)
SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
Revisiting Concept Drift in Windows Malware Detection: Adaptation to Real Drifted Malware with Minimal Samples [10.352741619176383]
We propose a new technique for detecting and classifying drifted malware.<n>It learns drift-invariant features in malware control flow graphs by leveraging graph neural networks with adversarial domain adaptation.<n>Our approach significantly improves drifted malware detection on publicly available benchmarks and real-world malware databases reported daily by security companies.
arXiv Detail & Related papers (2024-07-18T22:06:20Z)
Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time [5.999777817331315]
Concept drift is the phenomenon in which the underlying data distributions and statistical properties of a target domain change over time.<n>textscDriftLens is an unsupervised framework for real-time concept drift detection and characterization.
arXiv Detail & Related papers (2024-06-24T23:41:46Z)
Combating Concept Drift with Explanatory Detection and Adaptation for Android Malware Classification [17.399454244765842]
DREAM is a novel system that improves drift detection and establishes an explanatory adaptation process.<n>Our evaluation shows that DREAM effectively improves the drift detection accuracy and reduces the expert analysis effort in adaptation.
arXiv Detail & Related papers (2024-05-07T07:55:45Z)
Methods for Generating Drift in Text Streams [49.3179290313959]
Concept drift is a frequent phenomenon in real-world datasets and corresponds to changes in data distribution over time. This paper provides four textual drift generation methods to ease the production of datasets with labeled drifts. Results show that all methods have their performance degraded right after the drifts, and the incremental SVM is the fastest to run and recover the previous performance levels.
arXiv Detail & Related papers (2024-03-18T23:48:33Z)
MORPH: Towards Automated Concept Drift Adaptation for Malware Detection [0.7499722271664147]
Concept drift is a significant challenge for malware detection. Self-training has emerged as a promising approach to mitigate concept drift. We propose MORPH -- an effective pseudo-label-based concept drift adaptation method.
arXiv Detail & Related papers (2024-01-23T14:25:43Z)
Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments. Our approach enhances LiDAR-based detection models using spatial quantized historical features. Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z)
Uncovering Drift in Textual Data: An Unsupervised Method for Detecting and Mitigating Drift in Machine Learning Models [9.035254826664273]
Drift in machine learning refers to the phenomenon where the statistical properties of data or context, in which the model operates, change over time leading to a decrease in its performance. In our proposed unsupervised drift detection method, we follow a two step process. Our first step involves encoding a sample of production data as the target distribution, and the model training data as the reference distribution. Our method also identifies the subset of production data that is the root cause of the drift. The models retrained using these identified high drift samples show improved performance on online customer experience quality metrics.
arXiv Detail & Related papers (2023-09-07T16:45:42Z)
Adaptive Cross Batch Normalization for Metric Learning [75.91093210956116]
Metric learning is a fundamental problem in computer vision. We show that it is equally important to ensure that the accumulated embeddings are up to date. In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration.
arXiv Detail & Related papers (2023-03-30T03:22:52Z)
Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold. We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples. We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z)
Autoregressive based Drift Detection Method [0.0]
We propose a new concept drift detection method based on autoregressive models called ADDM. Our results show that this new concept drift detection method outperforms the state-of-the-art drift detection methods.
arXiv Detail & Related papers (2022-03-09T14:36:16Z)
Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC) We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)
Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z)
DriftSurf: A Risk-competitive Learning Algorithm under Concept Drift [12.579800289829963]
When learning from streaming data, a change in the data distribution, also known as concept drift, can render a previously-learned model inaccurate. We present an adaptive learning algorithm that extends previous drift-detection-based methods by incorporating drift detection into a broader stable-state/reactive-state process. The algorithm is generic in its base learner and can be applied across a variety of supervised learning problems.
arXiv Detail & Related papers (2020-03-13T23:25:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.