LFreeDA: Label-Free Drift Adaptation for Windows Malware Detection
- URL: http://arxiv.org/abs/2511.14963v1
- Date: Tue, 18 Nov 2025 23:08:26 GMT
- Title: LFreeDA: Label-Free Drift Adaptation for Windows Malware Detection
- Authors: Adrian Shuai Li, Elisa Bertino,
- Abstract summary: This paper introduces LFreeDA, an end-to-end framework that adapts malware classifiers to drift without manual labeling or drift detection.<n>LFreeDA first performs unsupervised domain adaptation on malware images, jointly training on labeled and unlabeled samples to infer pseudo-labels and prune noisy ones.<n> Evaluations show that LFreeDA improves accuracy by up to 12.6% and F1 by 11.1% over no-adaptation lower bounds, and is only 4% and 3.4% below fully supervised upper bounds in accuracy and F1, respectively.
- Score: 9.054165392355877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML)-based malware detectors degrade over time as concept drift introduces new and evolving families unseen during training. Retraining is limited by the cost and time of manual labeling or sandbox analysis. Existing approaches mitigate this via drift detection and selective labeling, but fully label-free adaptation remains largely unexplored. Recent self-training methods use a previously trained model to generate pseudo-labels for unlabeled data and then train a new model on these labels. The unlabeled data are used only for inference and do not participate in training the earlier model. We argue that these unlabeled samples still carry valuable information that can be leveraged when incorporated appropriately into training. This paper introduces LFreeDA, an end-to-end framework that adapts malware classifiers to drift without manual labeling or drift detection. LFreeDA first performs unsupervised domain adaptation on malware images, jointly training on labeled and unlabeled samples to infer pseudo-labels and prune noisy ones. It then adapts a classifier on CFG representations using the labeled and selected pseudo-labeled data, leveraging the scalability of images for pseudo-labeling and the richer semantics of CFGs for final adaptation. Evaluations on the real-world MB-24+ dataset show that LFreeDA improves accuracy by up to 12.6% and F1 by 11.1% over no-adaptation lower bounds, and is only 4% and 3.4% below fully supervised upper bounds in accuracy and F1, respectively. It also matches the performance of state-of-the-art methods provided with ground truth labels for 300 target samples. Additional results on two controlled-drift benchmarks further confirm that LFreeDA maintains malware detection performance as malware evolves without human labeling.
Related papers
- Efficient Adaptive Label Refinement for Label Noise Learning [14.617885790129336]
We propose Adaptive Label Refinement (ALR) to avoid incorrect labels and thoroughly learning clean samples.<n>ALR is simple and efficient, requiring no prior knowledge of noise or auxiliary datasets.<n>We validate ALR's effectiveness through experiments on benchmark datasets with artificial label noise (CIFAR-10/100) and real-world datasets with inherent noise (ANIMAL-10N, Clothing1M, WebVision)
arXiv Detail & Related papers (2025-02-01T09:58:08Z) - Retraining with Predicted Hard Labels Provably Increases Model Accuracy [77.71162068832108]
Retraining can improve the population accuracy obtained by initially training with the given (noisy) labels.<n>We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost.
arXiv Detail & Related papers (2024-06-17T04:53:47Z) - Uncertainty-Aware Pseudo-Label Filtering for Source-Free Unsupervised Domain Adaptation [45.53185386883692]
Source-free unsupervised domain adaptation (SFUDA) aims to enable the utilization of a pre-trained source model in an unlabeled target domain without access to source data.
We propose a method called Uncertainty-aware Pseudo-label-filtering Adaptation (UPA) to efficiently address this issue in a coarse-to-fine manner.
arXiv Detail & Related papers (2024-03-17T16:19:40Z) - Boosting Semi-Supervised Learning by bridging high and low-confidence
predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL)
We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z) - Refined Pseudo labeling for Source-free Domain Adaptive Object Detection [9.705172026751294]
Source-freeD is proposed to adapt source-trained detectors to target domains with only unlabeled target data.
Existing source-freeD methods typically utilize pseudo labeling, where the performance heavily relies on the selection of confidence threshold.
We present a category-aware adaptive threshold estimation module, which adaptively provides the appropriate threshold for each category.
arXiv Detail & Related papers (2023-03-07T08:31:42Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Uncertainty-aware Mean Teacher for Source-free Unsupervised Domain
Adaptive 3D Object Detection [6.345037597566315]
Pseudo-label based self training approaches are a popular method for source-free unsupervised domain adaptation.
We propose an uncertainty-aware mean teacher framework which implicitly filters incorrect pseudo-labels during training.
arXiv Detail & Related papers (2021-09-29T18:17:09Z) - Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning.
It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model.
It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z) - Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain
Adaptation [87.60688582088194]
We propose a novel Self-Supervised Noisy Label Learning method.
Our method can easily achieve state-of-the-art results and surpass other methods by a very large margin.
arXiv Detail & Related papers (2021-02-23T10:51:45Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z) - A Free Lunch for Unsupervised Domain Adaptive Object Detection without
Source Data [69.091485888121]
Unsupervised domain adaptation assumes that source and target domain data are freely available and usually trained together to reduce the domain gap.
We propose a source data-free domain adaptive object detection (SFOD) framework via modeling it into a problem of learning with noisy labels.
arXiv Detail & Related papers (2020-12-10T01:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.