On the Limitations of Continual Learning for Malware Classification
- URL: http://arxiv.org/abs/2208.06568v1
- Date: Sat, 13 Aug 2022 04:23:19 GMT
- Title: On the Limitations of Continual Learning for Malware Classification
- Authors: Mohammad Saidur Rahman, Scott E. Coull, Matthew Wright
- Abstract summary: We study 11 CL techniques applied to three malware tasks covering common incremental learning scenarios.
We evaluate the performance of the CL methods on both binary malware classification (Domain-IL) and multi-class malware family classification (Task-IL and Class-IL) tasks.
- Score: 18.567946765007658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Malicious software (malware) classification offers a unique challenge for
continual learning (CL) regimes due to the volume of new samples received on a
daily basis and the evolution of malware to exploit new vulnerabilities. On a
typical day, antivirus vendors receive hundreds of thousands of unique pieces
of software, both malicious and benign, and over the course of the lifetime of
a malware classifier, more than a billion samples can easily accumulate. Given
the scale of the problem, sequential training using continual learning
techniques could provide substantial benefits in reducing training and storage
overhead. To date, however, there has been no exploration of CL applied to
malware classification tasks. In this paper, we study 11 CL techniques applied
to three malware tasks covering common incremental learning scenarios,
including task, class, and domain incremental learning (IL). Specifically,
using two realistic, large-scale malware datasets, we evaluate the performance
of the CL methods on both binary malware classification (Domain-IL) and
multi-class malware family classification (Task-IL and Class-IL) tasks. To our
surprise, continual learning methods significantly underperformed naive Joint
replay of the training data in nearly all settings -- in some cases reducing
accuracy by more than 70 percentage points. A simple approach of selectively
replaying 20% of the stored data achieves better performance, with 50% of the
training time compared to Joint replay. Finally, we discuss potential reasons
for the unexpectedly poor performance of the CL techniques, with the hope that
it spurs further research on developing techniques that are more effective in
the malware classification domain.
Related papers
- MADAR: Efficient Continual Learning for Malware Analysis with Diversity-Aware Replay [21.54671696689243]
Continual learning holds the potential to reduce the storage and computational costs of regularly retraining over all the collected data.
We propose MADAR, a CL framework that accounts for the unique properties and challenges of the malware data distribution.
arXiv Detail & Related papers (2025-02-09T03:37:48Z) - MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification [1.9961741493139218]
Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats.
We introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples.
Our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%.
arXiv Detail & Related papers (2025-01-02T07:15:31Z) - MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning [8.724680868086626]
MalMixer is a semi-supervised malware family classifier that achieves high accuracy with sparse training data.
We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings.
arXiv Detail & Related papers (2024-09-20T04:50:49Z) - A Survey of Malware Detection Using Deep Learning [6.349503549199403]
This paper investigates advances in malware detection on Windows, iOS, Android, and Linux using deep learning (DL)
We discuss the issues and the challenges in malware detection using DL classifiers.
We examine eight popular DL approaches on various datasets.
arXiv Detail & Related papers (2024-07-27T02:49:55Z) - Class-Incremental Learning: A Survey [84.30083092434938]
Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally.
CIL tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades.
We provide a rigorous and unified evaluation of 17 methods in benchmark image classification tasks to find out the characteristics of different algorithms.
arXiv Detail & Related papers (2023-02-07T17:59:05Z) - LifeLonger: A Benchmark for Continual Disease Classification [59.13735398630546]
We introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection.
Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch.
Cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge.
arXiv Detail & Related papers (2022-04-12T12:25:05Z) - vCLIMB: A Novel Video Class Incremental Learning Benchmark [53.90485760679411]
We introduce vCLIMB, a novel video continual learning benchmark.
vCLIMB is a standardized test-bed to analyze catastrophic forgetting of deep models in video continual learning.
We propose a temporal consistency regularization that can be applied on top of memory-based continual learning methods.
arXiv Detail & Related papers (2022-01-23T22:14:17Z) - A Comprehensive Study on Learning-Based PE Malware Family Classification
Methods [9.142578100395909]
Portable Executable (PE) malware has been consistently evolving in terms of both volume and sophistication.
Three mainstream approaches that use learning based algorithms, as categorized by the input format the methods take, are image-based, binary-based and disassembly-based approaches.
In this work, we conduct a thorough empirical study on learning-based PE malware classification approaches on 4 different datasets and consistent experiment settings.
arXiv Detail & Related papers (2021-10-29T05:32:28Z) - Deep Learning and Traffic Classification: Lessons learned from a
commercial-grade dataset with hundreds of encrypted and zero-day applications [72.02908263225919]
We share our experience on a commercial-grade DL traffic classification engine.
We identify known applications from encrypted traffic, as well as unknown zero-day applications.
We propose a novel technique, tailored for DL models, that is significantly more accurate and light-weight than the state of the art.
arXiv Detail & Related papers (2021-04-07T15:21:22Z) - Incremental Embedding Learning via Zero-Shot Translation [65.94349068508863]
Current state-of-the-art incremental learning methods tackle catastrophic forgetting problem in traditional classification networks.
We propose a novel class-incremental method for embedding network, named as zero-shot translation class-incremental method (ZSTCI)
In addition, ZSTCI can easily be combined with existing regularization-based incremental learning methods to further improve performance of embedding networks.
arXiv Detail & Related papers (2020-12-31T08:21:37Z) - Adversarial EXEmples: A Survey and Experimental Evaluation of Practical
Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes.
We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks.
These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.