Related papers: TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version)

TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version)

URL: http://arxiv.org/abs/2402.01359v1
Date: Fri, 2 Feb 2024 12:27:32 GMT
Title: TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version)
Authors: Zeliang Kan, Shae McFadden, Daniel Arp, Feargus Pendlebury, Roberto Jordaney, Johannes Kinder, Fabio Pierazzi, Lorenzo Cavallaro
Abstract summary: Malware detectors often experience performance decay due to constantly evolving operating systems and attack methods. This paper argues that commonly reported results are inflated due to two pervasive sources of experimental bias in the detection task.
Score: 18.146377453918724
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning (ML) plays a pivotal role in detecting malicious software. Despite the high F1-scores reported in numerous studies reaching upwards of 0.99, the issue is not completely solved. Malware detectors often experience performance decay due to constantly evolving operating systems and attack methods, which can render previously learned knowledge insufficient for accurate decision-making on new inputs. This paper argues that commonly reported results are inflated due to two pervasive sources of experimental bias in the detection task: spatial bias caused by data distributions that are not representative of a real-world deployment; and temporal bias caused by incorrect time splits of data, leading to unrealistic configurations. To address these biases, we introduce a set of constraints for fair experiment design, and propose a new metric, AUT, for classifier robustness in real-world settings. We additionally propose an algorithm designed to tune training data to enhance classifier performance. Finally, we present TESSERACT, an open-source framework for realistic classifier comparison. Our evaluation encompasses both traditional ML and deep learning methods, examining published works on an extensive Android dataset with 259,230 samples over a five-year span. Additionally, we conduct case studies in the Windows PE and PDF domains. Our findings identify the existence of biases in previous studies and reveal that significant performance enhancements are possible through appropriate, periodic tuning. We explore how mitigation strategies may support in achieving a more stable and better performance over time by employing multiple strategies to delay performance decay.

Related papers

Breaking Fine-Grained Classification Barriers with Cost-Free Data in Few-Shot Class-Incremental Learning [13.805180905579832]
We propose a novel learning paradigm to break barriers in fine-grained classification. It enables the model to learn beyond the standard training phase and benefit from cost-free data encountered during system operation.
arXiv Detail & Related papers (2024-12-29T07:11:44Z)
Evaluating the Efficacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection [0.13980986259786224]
In real-world scenarios, such as fraud detection or credit scoring, labels may be delayed. batch incremental algorithms are widely used in many real-world tasks. Our findings indicate that instance incremental learning is not the superior option.
arXiv Detail & Related papers (2024-09-16T09:20:01Z)
Model Debiasing by Learnable Data Augmentation [19.625915578646758]
This paper proposes a novel 2-stage learning pipeline featuring a data augmentation strategy able to regularize the training. Experiments on synthetic and realistic biased datasets show state-of-the-art classification accuracy, outperforming competing methods.
arXiv Detail & Related papers (2024-08-09T09:19:59Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
Adaptive Rentention & Correction for Continual Learning [114.5656325514408]
A common problem in continual learning is the classification layer's bias towards the most recent task. We name our approach Adaptive Retention & Correction (ARC) ARC achieves an average performance increase of 2.7% and 2.6% on the CIFAR-100 and Imagenet-R datasets.
arXiv Detail & Related papers (2024-05-23T08:43:09Z)
Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy. We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples. Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z)
DELTA: degradation-free fully test-time adaptation [59.74287982885375]
We find that two unfavorable defects are concealed in the prevalent adaptation methodologies like test-time batch normalization (BN) and self-learning. First, we reveal that the normalization statistics in test-time BN are completely affected by the currently received test samples, resulting in inaccurate estimates. Second, we show that during test-time adaptation, the parameter update is biased towards some dominant classes.
arXiv Detail & Related papers (2023-01-30T15:54:00Z)
Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time [69.77704012415845]
Temporal shifts can considerably degrade performance of machine learning models deployed in the real world. We benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning. Under both evaluation strategies, we observe an average performance drop of 20% from in-distribution to out-of-distribution data.
arXiv Detail & Related papers (2022-11-25T17:07:53Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Robust Fairness-aware Learning Under Sample Selection Bias [17.09665420515772]
We propose a framework for robust and fair learning under sample selection bias. We develop two algorithms to handle sample selection bias when test data is both available and unavailable.
arXiv Detail & Related papers (2021-05-24T23:23:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.