On building machine learning pipelines for Android malware detection: a
procedural survey of practices, challenges and opportunities
- URL: http://arxiv.org/abs/2306.07118v1
- Date: Mon, 12 Jun 2023 13:52:28 GMT
- Title: On building machine learning pipelines for Android malware detection: a
procedural survey of practices, challenges and opportunities
- Authors: Masoud Mehrabi Koushki, Ibrahim AbuAlhaol, Anandharaju Durai Raju,
Yang Zhou, Ronnie Salvador Giagone and Huang Shengqiang
- Abstract summary: As the smartphone market leader, Android has been a prominent target for malware attacks.
For market holders and researchers, in particular, the large number of samples has made manual malware detection unfeasible.
While some of the proposed approaches achieve high performance, rapidly evolving Android malware has made them unable to maintain their accuracy over time.
- Score: 4.8460847676785175
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: As the smartphone market leader, Android has been a prominent target for
malware attacks. The number of malicious applications (apps) identified for it
has increased continually over the past decade, creating an immense challenge
for all parties involved. For market holders and researchers, in particular,
the large number of samples has made manual malware detection unfeasible,
leading to an influx of research that investigate Machine Learning (ML)
approaches to automate this process. However, while some of the proposed
approaches achieve high performance, rapidly evolving Android malware has made
them unable to maintain their accuracy over time. This has created a need in
the community to conduct further research, and build more flexible ML
pipelines. Doing so, however, is currently hindered by a lack of systematic
overview of the existing literature, to learn from and improve upon the
existing solutions. Existing survey papers often focus only on parts of the ML
process (e.g., data collection or model deployment), while omitting other
important stages, such as model evaluation and explanation. In this paper, we
address this problem with a review of 42 highly-cited papers, spanning a decade
of research (from 2011 to 2021). We introduce a novel procedural taxonomy of
the published literature, covering how they have used ML algorithms, what
features they have engineered, which dimensionality reduction techniques they
have employed, what datasets they have employed for training, and what their
evaluation and explanation strategies are. Drawing from this taxonomy, we also
identify gaps in knowledge and provide ideas for improvement and future work.
Related papers
- SoK: Dataset Copyright Auditing in Machine Learning Systems [23.00196984807359]
This paper examines the current dataset copyright auditing tools, examining their effectiveness and limitations.
We categorize dataset copyright auditing research into two prominent strands: intrusive methods and non-intrusive methods.
To summarize our results, we offer detailed reference tables, highlight key points, and pinpoint unresolved issues in the current literature.
arXiv Detail & Related papers (2024-10-22T02:06:38Z) - Revisiting Static Feature-Based Android Malware Detection [0.8192907805418583]
This paper highlights critical pitfalls that undermine the validity of machine learning research in Android malware detection.
We propose solutions for improving datasets and methodological practices, enabling fairer model comparisons.
Our paper aims to support future research in Android malware detection and other security domains, enhancing the reliability and validity of published results.
arXiv Detail & Related papers (2024-09-11T16:37:50Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Investigating Reproducibility in Deep Learning-Based Software Fault
Prediction [16.25827159504845]
With the rapid adoption of increasingly complex machine learning models, it becomes more and more difficult for scholars to reproduce the results that are reported in the literature.
This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared.
We have conducted a systematic review of the current literature and examined the level of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences.
arXiv Detail & Related papers (2024-02-08T13:00:18Z) - Unraveling the Key of Machine Learning Solutions for Android Malware
Detection [33.63795751798441]
This paper presents a comprehensive investigation into machine learning-based Android malware detection.
We first survey the literature, categorizing contributions into a taxonomy based on the Android feature engineering and ML modeling pipeline.
Then, we design a general-propose framework for ML-based Android malware detection, re-implement 12 representative approaches from different research communities, and evaluate them from three primary dimensions, i.e. effectiveness, robustness, and efficiency.
arXiv Detail & Related papers (2024-02-05T12:31:19Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision
Research [96.53307645791179]
We introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks.
Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth.
Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks.
arXiv Detail & Related papers (2022-11-15T18:57:46Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Novel Applications for VAE-based Anomaly Detection Systems [5.065947993017157]
Deep generative modeling (DGM) can create novel and unseen data, starting from a given data set.
As the technology shows promising applications, many ethical issues also arise.
Research indicates different biases affect deep learning models, leading to social issues such as misrepresentation.
arXiv Detail & Related papers (2022-04-26T20:30:37Z) - Inspect, Understand, Overcome: A Survey of Practical Methods for AI
Safety [54.478842696269304]
The use of deep neural networks (DNNs) in safety-critical applications is challenging due to numerous model-inherent shortcomings.
In recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged.
Our paper addresses both machine learning experts and safety engineers.
arXiv Detail & Related papers (2021-04-29T09:54:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.