R+R: Revisiting Static Feature-Based Android Malware Detection using Machine Learning
- URL: http://arxiv.org/abs/2409.07397v2
- Date: Sat, 01 Nov 2025 09:19:46 GMT
- Title: R+R: Revisiting Static Feature-Based Android Malware Detection using Machine Learning
- Authors: Md Tanvirul Alam, Dipkamal Bhusal, Nidhi Rastogi,
- Abstract summary: Static feature-based Android malware detection using machine learning (ML) remains critical due to its scalability and efficiency.<n>Existing approaches often overlook security-critical concerns.<n>We propose a more rigorous methodology for model selection and evaluation.
- Score: 4.014524824655106
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Static feature-based Android malware detection using machine learning (ML) remains critical due to its scalability and efficiency. However, existing approaches often overlook security-critical reproducibility concerns, such as dataset duplication, inadequate hyperparameter tuning, and variance from random initialization. This can significantly compromise the practical effectiveness of these systems. In this paper, we systematically investigate these challenges by proposing a more rigorous methodology for model selection and evaluation. Using two widely used datasets, Drebin and APIGraph, we evaluate six ML models of varying complexity under both offline and continuous active learning settings. Our analysis demonstrates that, contrary to popular belief, well-tuned, simpler models, particularly tree-based methods like XGBoost, consistently outperform more complex neural networks, especially when duplicates are removed. To promote transparency and reproducibility, we open-source our codebase, which is extensible for integrating new models and datasets, facilitating reproducible security research.
Related papers
- Efficient Machine Unlearning via Influence Approximation [75.31015485113993]
Influence-based unlearning has emerged as a prominent approach to estimate the impact of individual training samples on model parameters without retraining.<n>This paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning)<n>We introduce the Influence Approximation Unlearning algorithm for efficient machine unlearning from the incremental perspective.
arXiv Detail & Related papers (2025-07-31T05:34:27Z) - Anomaly Detection and Generation with Diffusion Models: A Survey [51.61574868316922]
Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing.<n>Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest.<n>This survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
arXiv Detail & Related papers (2025-06-11T03:29:18Z) - Reinforced Interactive Continual Learning via Real-time Noisy Human Feedback [59.768119380109084]
This paper introduces an interactive continual learning paradigm where AI models dynamically learn new skills from real-time human feedback.<n>We propose RiCL, a Reinforced interactive Continual Learning framework leveraging Large Language Models (LLMs)<n>Our RiCL approach substantially outperforms existing combinations of state-of-the-art online continual learning and noisy-label learning methods.
arXiv Detail & Related papers (2025-05-15T03:22:03Z) - Large Language Model (LLM) for Software Security: Code Analysis, Malware Analysis, Reverse Engineering [3.1195311942826303]
Large Language Models (LLMs) have emerged as powerful tools in cybersecurity.
LLMs offer advanced capabilities in malware detection, generation, and real-time monitoring.
arXiv Detail & Related papers (2025-04-07T22:32:46Z) - ShuffleGate: An Efficient and Self-Polarizing Feature Selection Method for Large-Scale Deep Models in Industry [12.690406065558394]
ShuffleGate shuffles all feature values across instances simultaneously.<n>It can generate well-separated feature importance scores and estimate the performance without retraining the model.<n>It has been successfully integrated into the daily iteration of Bilibili's search models across various scenarios.
arXiv Detail & Related papers (2025-03-12T12:05:03Z) - Breaking Focus: Contextual Distraction Curse in Large Language Models [68.4534308805202]
We investigate a critical vulnerability in Large Language Models (LLMs)
This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context.
We propose an efficient tree-based search methodology to automatically generate CDV examples.
arXiv Detail & Related papers (2025-02-03T18:43:36Z) - What Really Matters for Learning-based LiDAR-Camera Calibration [50.2608502974106]
This paper revisits the development of learning-based LiDAR-Camera calibration.
We identify the critical limitations of regression-based methods with the widely used data generation pipeline.
We also investigate how the input data format and preprocessing operations impact network performance.
arXiv Detail & Related papers (2025-01-28T14:12:32Z) - Predicting Vulnerability to Malware Using Machine Learning Models: A Study on Microsoft Windows Machines [0.0]
This study addresses the need for effective malware detection strategies by leveraging Machine Learning (ML) techniques.
Our research aims to develop an advanced ML model that accurately predicts malware vulnerabilities based on the specific conditions of individual machines.
arXiv Detail & Related papers (2025-01-05T10:04:58Z) - Underwater Object Detection in the Era of Artificial Intelligence: Current, Challenge, and Future [119.88454942558485]
Underwater object detection (UOD) aims to identify and localise objects in underwater images or videos.
In recent years, artificial intelligence (AI) based methods, especially deep learning methods, have shown promising performance in UOD.
arXiv Detail & Related papers (2024-10-08T00:25:33Z) - Verification of Machine Unlearning is Fragile [48.71651033308842]
We introduce two novel adversarial unlearning processes capable of circumventing both types of verification strategies.
This study highlights the vulnerabilities and limitations in machine unlearning verification, paving the way for further research into the safety of machine unlearning.
arXiv Detail & Related papers (2024-08-01T21:37:10Z) - A Survey of Malware Detection Using Deep Learning [6.349503549199403]
This paper investigates advances in malware detection on Windows, iOS, Android, and Linux using deep learning (DL)
We discuss the issues and the challenges in malware detection using DL classifiers.
We examine eight popular DL approaches on various datasets.
arXiv Detail & Related papers (2024-07-27T02:49:55Z) - Learning to Continually Learn with the Bayesian Principle [36.75558255534538]
In this work, we adopt the meta-learning paradigm to combine the strong representational power of neural networks and simple statistical models' robustness to forgetting.
Since the neural networks remain fixed during continual learning, they are protected from catastrophic forgetting.
arXiv Detail & Related papers (2024-05-29T04:53:31Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection [0.5475886285082937]
This study conducts a thorough examination of malware detection using machine learning techniques.
The aim is to advance cybersecurity capabilities by identifying and mitigating threats more effectively.
arXiv Detail & Related papers (2024-03-04T17:22:43Z) - A Multi-step Loss Function for Robust Learning of the Dynamics in
Model-based Reinforcement Learning [10.940666275830052]
In model-based reinforcement learning, most algorithms rely on simulating trajectories from one-step models of the dynamics learned on data.
We tackle this issue by using a multi-step objective to train one-step models.
We find that this new loss is particularly useful when the data is noisy, which is often the case in real-life environments.
arXiv Detail & Related papers (2024-02-05T16:13:00Z) - Unraveling the Key of Machine Learning Solutions for Android Malware
Detection [33.63795751798441]
This paper presents a comprehensive investigation into machine learning-based Android malware detection.
We first survey the literature, categorizing contributions into a taxonomy based on the Android feature engineering and ML modeling pipeline.
Then, we design a general-propose framework for ML-based Android malware detection, re-implement 12 representative approaches from different research communities, and evaluate them from three primary dimensions, i.e. effectiveness, robustness, and efficiency.
arXiv Detail & Related papers (2024-02-05T12:31:19Z) - Malicious code detection in android: the role of sequence characteristics and disassembling methods [0.0]
We investigate and emphasize the factors that may affect the accuracy values of the models managed by researchers.
Our findings exhibit that the disassembly method and different input representations affect the model results.
arXiv Detail & Related papers (2023-12-02T11:55:05Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks.
We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z) - Adversarial Patterns: Building Robust Android Malware Classifiers [0.9208007322096533]
In the field of cybersecurity, machine learning models have made significant improvements in malware detection.
Despite their ability to understand complex patterns from unstructured data, these models are susceptible to adversarial attacks.
This paper provides a comprehensive review of adversarial machine learning in the context of Android malware classifiers.
arXiv Detail & Related papers (2022-03-04T03:47:08Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Optimization-driven Machine Learning for Intelligent Reflecting Surfaces
Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts.
Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity.
In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z) - Generative Feature Replay with Orthogonal Weight Modification for
Continual Learning [20.8966035274874]
generative replay is a promising strategy which generates and replays pseudo data for previous tasks to alleviate catastrophic forgetting.
We propose to replay penultimate layer feature with a generative model; 2) leverage a self-supervised auxiliary task to further enhance the stability of feature.
Empirical results on several datasets show our method always achieves substantial improvement over powerful OWM.
arXiv Detail & Related papers (2020-05-07T13:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.