The Impact of Train-Test Leakage on Machine Learning-based Android Malware Detection
- URL: http://arxiv.org/abs/2410.19364v1
- Date: Fri, 25 Oct 2024 08:04:01 GMT
- Title: The Impact of Train-Test Leakage on Machine Learning-based Android Malware Detection
- Authors: Guojun Liu, Doina Caragea, Xinming Ou, Sankardas Roy,
- Abstract summary: We identify distinct Android apps that have identical or nearly identical app representations.
This will lead to a data leakage problem that inflates a machine learning model's performance.
We propose a leak-aware scheme to construct a machine learning-based Android malware detector.
- Score: 6.9053043489744015
- License:
- Abstract: When machine learning is used for Android malware detection, an app needs to be represented in a numerical format for training and testing. We identify a widespread occurrence of distinct Android apps that have identical or nearly identical app representations. In particular, among app samples in the testing dataset, there can be a significant percentage of apps that have an identical or nearly identical representation to an app in the training dataset. This will lead to a data leakage problem that inflates a machine learning model's performance as measured on the testing dataset. The data leakage not only could lead to overly optimistic perceptions on the machine learning models' ability to generalize beyond the data on which they are trained, in some cases it could also lead to qualitatively different conclusions being drawn from the research. We present two case studies to illustrate this impact. In the first case study, the data leakage inflated the performance results but did not impact the overall conclusions made by the researchers in a qualitative way. In the second case study, the data leakage problem would have led to qualitatively different conclusions being drawn from the research. We further propose a leak-aware scheme to construct a machine learning-based Android malware detector, and show that it can improve upon the overall detection performance.
Related papers
- A Survey of Malware Detection Using Deep Learning [6.349503549199403]
This paper investigates advances in malware detection on Windows, iOS, Android, and Linux using deep learning (DL)
We discuss the issues and the challenges in malware detection using DL classifiers.
We examine eight popular DL approaches on various datasets.
arXiv Detail & Related papers (2024-07-27T02:49:55Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Android Malware Detection with Unbiased Confidence Guarantees [1.6432632226868131]
We propose a machine learning dynamic analysis approach that provides provably valid confidence guarantees in each malware detection.
The proposed approach is based on a novel machine learning framework, called Conformal Prediction, combined with a random forests classifier.
We examine its performance on a large-scale dataset collected by installing 1866 malicious and 4816 benign applications on a real android device.
arXiv Detail & Related papers (2023-12-17T11:07:31Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - An Outlier Exposure Approach to Improve Visual Anomaly Detection
Performance for Mobile Robots [76.36017224414523]
We consider the problem of building visual anomaly detection systems for mobile robots.
Standard anomaly detection models are trained using large datasets composed only of non-anomalous data.
We tackle the problem of exploiting these data to improve the performance of a Real-NVP anomaly detection model.
arXiv Detail & Related papers (2022-09-20T15:18:13Z) - Multifamily Malware Models [5.414308305392762]
We conduct experiments based on byte $n$-gram features to quantify the relationship between the generality of the training dataset and the accuracy of the corresponding machine learning models.
We find that neighborhood-based algorithms generalize surprisingly well, far outperforming the other machine learning techniques considered.
arXiv Detail & Related papers (2022-06-27T13:06:31Z) - On the impact of dataset size and class imbalance in evaluating
machine-learning-based windows malware detection techniques [0.0]
Some researchers use smaller datasets, and if dataset size has a significant impact on performance, that makes comparison of the published results difficult.
The project identified two key objectives, to understand if dataset size correlates to measured detector performance to an extent that prevents meaningful comparison of published results.
Results suggested that high accuracy scores don't necessarily translate to high real-world performance.
arXiv Detail & Related papers (2022-06-13T15:37:31Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.