Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020
- URL: http://arxiv.org/abs/2602.08744v1
- Date: Mon, 09 Feb 2026 14:47:47 GMT
- Title: Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020
- Authors: Diego Ferreira Duarte, Andre Augusto Bortoli,
- Abstract summary: This work tests Machine Learning algorithms in detecting malicious code from dynamic execution characteristics.<n>In 75% of the tested configurations, the application of SMOTE led to performance degradation or only marginal improvements.<n>Tree-based algorithms, such as XGBoost and Random Forest, consistently outperformed the others, achieving weighted recall above 94%.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malware, malicious software designed to damage computer systems and perpetrate scams, is proliferating at an alarming rate, with thousands of new threats emerging daily. Android devices, prevalent in smartphones, smartwatches, tablets, and IoTs, represent a vast attack surface, making malware detection crucial. Although advanced analysis techniques exist, Machine Learning (ML) emerges as a promising tool to automate and accelerate the discovery of these threats. This work tests ML algorithms in detecting malicious code from dynamic execution characteristics. For this purpose, the CICMalDroid2020 dataset, composed of dynamically obtained Android malware behavior samples, was used with the algorithms XGBoost, Naıve Bayes (NB), Support Vector Classifier (SVC), and Random Forest (RF). The study focused on empirically evaluating the impact of the SMOTE technique, used to mitigate class imbalance in the data, on the performance of these models. The results indicate that, in 75% of the tested configurations, the application of SMOTE led to performance degradation or only marginal improvements, with an average loss of 6.14 percentage points. Tree-based algorithms, such as XGBoost and Random Forest, consistently outperformed the others, achieving weighted recall above 94%. It is inferred that SMOTE, although widely used, did not prove beneficial for Android malware detection in the CICMalDroid2020 dataset, possibly due to the complexity and sparsity of dynamic characteristics or the nature of malicious relationships. This work highlights the robustness of tree-ensemble models, such as XGBoost, and suggests that algorithmic data balancing approaches may be more effective than generating synthetic instances in certain cybersecurity scenarios
Related papers
- Synthetic Data: AI's New Weapon Against Android Malware [0.0]
Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques.<n>MalSynGen is a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic data.<n>This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers.
arXiv Detail & Related papers (2025-11-24T19:27:58Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Imbalanced malware classification: an approach based on dynamic classifier selection [0.0]
A significant challenge in malware detection is the imbalance in datasets, where most applications are benign, with only a small fraction posing a threat.<n>This study addresses the often-overlooked issue of class imbalance in malware detection by evaluating various machine learning strategies for detecting malware in Android applications.
arXiv Detail & Related papers (2025-03-30T19:12:16Z) - CorrNetDroid: Android Malware Detector leveraging a Correlation-based Feature Selection for Network Traffic features [2.9069289358935073]
This work proposes a dynamic analysis-based Android malware detection system, CorrNetDroid, that works over network traffic flows.<n>Many traffic features exhibit overlapping ranges in normal and malware datasets.<n>Our model effectively reduces the feature set while detecting Android malware with 99.50 percent accuracy when considering only two network traffic features.
arXiv Detail & Related papers (2025-03-03T10:52:34Z) - MASKDROID: Robust Android Malware Detection with Masked Graph Representations [56.09270390096083]
We propose MASKDROID, a powerful detector with a strong discriminative ability to identify malware.
We introduce a masking mechanism into the Graph Neural Network based framework, forcing MASKDROID to recover the whole input graph.
This strategy enables the model to understand the malicious semantics and learn more stable representations, enhancing its robustness against adversarial attacks.
arXiv Detail & Related papers (2024-09-29T07:22:47Z) - Prompt Engineering-assisted Malware Dynamic Analysis Using GPT-4 [45.935748395725206]
We introduce a prompt engineering-assisted malware dynamic analysis using GPT-4.
In this method, GPT-4 is employed to create explanatory text for each API call within the API sequence.
BERT is used to obtain the representation of the text, from which we derive the representation of the API sequence.
arXiv Detail & Related papers (2023-12-13T17:39:44Z) - DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified
Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection.
Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables.
We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z) - OOG- Optuna Optimized GAN Sampling Technique for Tabular Imbalanced
Malware Data [0.0]
Generative Adversarial Network (GAN) sampling technique has been used in this study to generate new malware samples.
In this study, the architecture of the Optuna Optimized GAN (OOG) method is shown, along with scores of 98.06%, 99.0%, 97.23%, and 98.04% for accuracy, precision, recall and f1 score respectively.
arXiv Detail & Related papers (2022-11-25T16:59:30Z) - CARLA-GeAR: a Dataset Generator for a Systematic Evaluation of
Adversarial Robustness of Vision Models [61.68061613161187]
This paper presents CARLA-GeAR, a tool for the automatic generation of synthetic datasets for evaluating the robustness of neural models against physical adversarial patches.
The tool is built on the CARLA simulator, using its Python API, and allows the generation of datasets for several vision tasks in the context of autonomous driving.
The paper presents an experimental study to evaluate the performance of some defense methods against such attacks, showing how the datasets generated with CARLA-GeAR might be used in future work as a benchmark for adversarial defense in the real world.
arXiv Detail & Related papers (2022-06-09T09:17:38Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Detection of Malicious Android Applications: Classical Machine Learning
vs. Deep Neural Network Integrated with Clustering [2.179313476241343]
Traditional malware detection mechanisms are not able to cope-up with next-generation malware attacks.
We propose effective and efficient Android malware detection models based on machine learning and deep learning integrated with clustering.
arXiv Detail & Related papers (2021-02-28T21:50:57Z) - Bayesian Optimization with Machine Learning Algorithms Towards Anomaly
Detection [66.05992706105224]
In this paper, an effective anomaly detection framework is proposed utilizing Bayesian Optimization technique.
The performance of the considered algorithms is evaluated using the ISCX 2012 dataset.
Experimental results show the effectiveness of the proposed framework in term of accuracy rate, precision, low-false alarm rate, and recall.
arXiv Detail & Related papers (2020-08-05T19:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.