Related papers: Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods

Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods

URL: http://arxiv.org/abs/2301.12778v3
Date: Mon, 26 Aug 2024 07:19:33 GMT
Title: Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods
Authors: Ali Muzaffar, Hani Ragab Hassen, Hind Zantout, Michael A Lones,
Abstract summary: We show that high detection accuracies can be achieved using features extracted through static analysis alone. Random forests are generally the most effective model, outperforming more complex deep learning approaches.
Score: 2.9248916859490173
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The popularity of Android means it is a common target for malware. Over the years, various studies have found that machine learning models can effectively discriminate malware from benign applications. However, as the operating system evolves, so does malware, bringing into question the findings of these previous studies, many of which report very high accuracies using small, outdated, and often imbalanced datasets. In this paper, we reimplement 18 representative past works and reevaluate them using a balanced, relevant, and up-to-date dataset comprising 124,000 applications. We also carry out new experiments designed to fill holes in existing knowledge, and use our findings to identify the most effective features and models to use for Android malware detection within a contemporary environment. We show that high detection accuracies (up to 96.8%) can be achieved using features extracted through static analysis alone, yielding a modest benefit (1%) from using far more expensive dynamic analysis. API calls and opcodes are the most productive static and TCP network traffic provide the most predictive dynamic features. Random forests are generally the most effective model, outperforming more complex deep learning approaches. Whilst directly combining static and dynamic features is generally ineffective, ensembling models separately leads to performances comparable to the best models but using less brittle features.

Related papers

Imbalanced malware classification: an approach based on dynamic classifier selection [0.0]
A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat. This study addresses the often-overlooked issue of class imbalance in malware detection by evaluating various machine learning strategies for detecting malware in Android applications.
arXiv Detail & Related papers (2025-03-30T19:12:16Z)
Benchmarking Android Malware Detection: Rethinking the Role of Traditional and Deep Learning Models [6.9053043489744015]
Android malware detection has been extensively studied using both traditional machine learning (ML) and deep learning (DL) approaches. While many state-of-the-art detection models claim superior performance, they often rely on limited comparisons. This raises concerns about the robustness of DL-based approaches' performance and the potential oversight of simpler, more efficient ML models.
arXiv Detail & Related papers (2025-02-20T20:56:05Z)
Revisiting Static Feature-Based Android Malware Detection [0.8192907805418583]
This paper highlights critical pitfalls that undermine the validity of machine learning research in Android malware detection. We propose solutions for improving datasets and methodological practices, enabling fairer model comparisons. Our paper aims to support future research in Android malware detection and other security domains, enhancing the reliability and validity of published results.
arXiv Detail & Related papers (2024-09-11T16:37:50Z)
PromptSAM+: Malware Detection based on Prompt Segment Anything Model [8.00932560688061]
We propose a visual malware general enhancement classification framework, PromptSAM+', based on a large visual network segmentation model. Our experimental results indicate that 'PromptSAM+' is effective and efficient in malware detection and classification, achieving high accuracy and low rates of false positives and negatives.
arXiv Detail & Related papers (2024-08-04T15:42:34Z)
AppPoet: Large Language Model based Android malware detection via multi-view prompt engineering [1.3197408989895103]
AppPoet is a multi-view system for Android malware detection. Our method achieves a detection accuracy of 97.15% and an F1 score of 97.21%, which is superior to the baseline methods.
arXiv Detail & Related papers (2024-04-29T15:52:45Z)
Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines. Academic research is often restrained to public datasets on the order of ten thousand samples. We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z)
Malicious code detection in android: the role of sequence characteristics and disassembling methods [0.0]
We investigate and emphasize the factors that may affect the accuracy values of the models managed by researchers. Our findings exhibit that the disassembly method and different input representations affect the model results.
arXiv Detail & Related papers (2023-12-02T11:55:05Z)
Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z)
Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF) It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z)
Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications. We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data. Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z)
Malware Classification Using Static Disassembly and Machine Learning [1.5469452301122177]
We propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content, and import libraries, to classify malware families. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. We show that the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40%.
arXiv Detail & Related papers (2021-12-10T18:14:47Z)
Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.