Revisiting Concept Drift in Windows Malware Detection: Adaptation to Real Drifted Malware with Minimal Samples
- URL: http://arxiv.org/abs/2407.13918v2
- Date: Thu, 19 Dec 2024 20:05:59 GMT
- Title: Revisiting Concept Drift in Windows Malware Detection: Adaptation to Real Drifted Malware with Minimal Samples
- Authors: Adrian Shuai Li, Arun Iyengar, Ashish Kundu, Elisa Bertino,
- Abstract summary: We propose a new technique for detecting and classifying drifted malware.
It learns drift-invariant features in malware control flow graphs by leveraging graph neural networks with adversarial domain adaptation.
Our approach significantly improves drifted malware detection on publicly available benchmarks and real-world malware databases reported daily by security companies.
- Score: 10.352741619176383
- License:
- Abstract: In applying deep learning for malware classification, it is crucial to account for the prevalence of malware evolution, which can cause trained classifiers to fail on drifted malware. Existing solutions to address concept drift use active learning. They select new samples for analysts to label and then retrain the classifier with the new labels. Our key finding is that the current retraining techniques do not achieve optimal results. These techniques overlook that updating the model with scarce drifted samples requires learning features that remain consistent across pre-drift and post-drift data. The model should thus be able to disregard specific features that, while beneficial for the classification of pre-drift data, are absent in post-drift data, thereby preventing prediction degradation. In this paper, we propose a new technique for detecting and classifying drifted malware that learns drift-invariant features in malware control flow graphs by leveraging graph neural networks with adversarial domain adaptation. We compare it with existing model retraining methods in active learning-based malware detection systems and other domain adaptation techniques from the vision domain. Our approach significantly improves drifted malware detection on publicly available benchmarks and real-world malware databases reported daily by security companies in 2024. We also tested our approach in predicting multiple malware families drifted over time. A thorough evaluation shows that our approach outperforms the state-of-the-art approaches.
Related papers
- Cluster Analysis and Concept Drift Detection in Malware [1.3812010983144798]
Concept drift refers to gradual or sudden changes in the properties of data that affect the accuracy of machine learning models.
We propose and analyze a clustering-based approach to detecting concept drift in the malware domain.
arXiv Detail & Related papers (2025-02-19T22:42:30Z) - DREAM: Combating Concept Drift with Explanatory Detection and Adaptation in Malware Classification [15.912839650827589]
The rapid evolution of malware, especially with new families, can depress classification accuracy to near-random levels.
Previous research has primarily focused on detecting drift samples, relying on expert-led analysis and labeling for model retraining.
We introduce DREAM, a novel system designed to surpass the capabilities of existing drift detectors.
arXiv Detail & Related papers (2024-05-07T07:55:45Z) - MORPH: Towards Automated Concept Drift Adaptation for Malware Detection [0.7499722271664147]
Concept drift is a significant challenge for malware detection.
Self-training has emerged as a promising approach to mitigate concept drift.
We propose MORPH -- an effective pseudo-label-based concept drift adaptation method.
arXiv Detail & Related papers (2024-01-23T14:25:43Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Activate and Reject: Towards Safe Domain Generalization under Category
Shift [71.95548187205736]
We study a practical problem of Domain Generalization under Category Shift (DGCS)
It aims to simultaneously detect unknown-class samples and classify known-class samples in the target domains.
Compared to prior DG works, we face two new challenges: 1) how to learn the concept of unknown'' during training with only source known-class samples, and 2) how to adapt the source-trained model to unseen environments.
arXiv Detail & Related papers (2023-10-07T07:53:12Z) - Optimized Deep Learning Models for Malware Detection under Concept Drift [0.0]
We propose a model-agnostic protocol to improve a baseline neural network against drift.
We show the importance of feature reduction and training with the most recent validation set possible, and propose a loss function named Drift-Resilient Binary Cross-Entropy.
Our improved model shows promising results, detecting 15.2% more malware than a baseline model.
arXiv Detail & Related papers (2023-08-21T16:13:23Z) - Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection
Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications.
We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z) - Continual Learning with Bayesian Model based on a Fixed Pre-trained
Feature Extractor [55.9023096444383]
Current deep learning models are characterised by catastrophic forgetting of old knowledge when learning new classes.
Inspired by the process of learning new knowledge in human brains, we propose a Bayesian generative model for continual learning.
arXiv Detail & Related papers (2022-04-28T08:41:51Z) - Transfer Learning without Knowing: Reprogramming Black-box Machine
Learning Models with Scarce Data and Limited Resources [78.72922528736011]
We propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box machine learning model.
Using zeroth order optimization and multi-label mapping techniques, BAR can reprogram a black-box ML model solely based on its input-output responses.
BAR outperforms state-of-the-art methods and yields comparable performance to the vanilla adversarial reprogramming method.
arXiv Detail & Related papers (2020-07-17T01:52:34Z) - Scalable Backdoor Detection in Neural Networks [61.39635364047679]
Deep learning models are vulnerable to Trojan attacks, where an attacker can install a backdoor during training time to make the resultant model misidentify samples contaminated with a small trigger patch.
We propose a novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types.
In experiments, we observe that our method achieves a perfect score in separating Trojaned models from pure models, which is an improvement over the current state-of-the art method.
arXiv Detail & Related papers (2020-06-10T04:12:53Z) - Exploring Optimal Deep Learning Models for Image-based Malware Variant
Classification [3.8073142980733]
We study the impact of differences in deep learning models and the degree of transfer learning on the classification accuracy of malware variants.
We found that the highest classification accuracy was obtained by fine-tuning one of the latest deep learning models with a relatively low degree of transfer learning.
arXiv Detail & Related papers (2020-04-10T23:45:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.