Synthetic Data: AI's New Weapon Against Android Malware
- URL: http://arxiv.org/abs/2511.19649v1
- Date: Mon, 24 Nov 2025 19:27:58 GMT
- Title: Synthetic Data: AI's New Weapon Against Android Malware
- Authors: Angelo Gaspar Diniz Nogueira, Kayua Oleques Paim, Hendrio Bragança, Rodrigo Brandão Mansilha, Diego Kreutz,
- Abstract summary: Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques.<n>MalSynGen is a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic data.<n>This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.
Related papers
- Empirical Evaluation of SMOTE in Android Malware Detection with Machine Learning: Challenges and Performance in CICMalDroid 2020 [0.0]
This work tests Machine Learning algorithms in detecting malicious code from dynamic execution characteristics.<n>In 75% of the tested configurations, the application of SMOTE led to performance degradation or only marginal improvements.<n>Tree-based algorithms, such as XGBoost and Random Forest, consistently outperformed the others, achieving weighted recall above 94%.
arXiv Detail & Related papers (2026-02-09T14:47:47Z) - ThreatIntel-Andro: Expert-Verified Benchmarking for Robust Android Malware Research [12.287399657700824]
Real-time Android malware datasets are a critical foundation for effective detection and defense.<n>Traditional datasets, such as VirusTotal's multi-engine aggregation results, exhibit significant limitations.<n> automated labeling tools (e.g., AVClass2) suffer from suboptimal aggregation strategies.
arXiv Detail & Related papers (2025-10-19T13:51:27Z) - LLM-Generated Samples for Android Malware Detection [0.6187780920448871]
We fine-tune GPT-4.1-mini to produce structured records for three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS.<n>We evaluate multiple classifiers under three settings: training with real data only, real-plus-synthetic data, and synthetic data alone.<n>Results show that real-only training achieves near perfect detection, while augmentation with synthetic data preserves high performance with only minor degradations.
arXiv Detail & Related papers (2025-09-30T23:46:57Z) - PuckTrick: A Library for Making Synthetic Data More Realistic [46.198289193451146]
We introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors.<n>We evaluate the impact of systematic data contamination on model performance.<n>Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data.
arXiv Detail & Related papers (2025-06-23T10:51:45Z) - R+R: Revisiting Static Feature-Based Android Malware Detection using Machine Learning [4.014524824655106]
Static feature-based Android malware detection using machine learning (ML) remains critical due to its scalability and efficiency.<n>Existing approaches often overlook security-critical concerns.<n>We propose a more rigorous methodology for model selection and evaluation.
arXiv Detail & Related papers (2024-09-11T16:37:50Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection.
It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks.
It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z) - Android Malware Detection with Unbiased Confidence Guarantees [1.6432632226868131]
We propose a machine learning dynamic analysis approach that provides provably valid confidence guarantees in each malware detection.
The proposed approach is based on a novel machine learning framework, called Conformal Prediction, combined with a random forests classifier.
We examine its performance on a large-scale dataset collected by installing 1866 malicious and 4816 benign applications on a real android device.
arXiv Detail & Related papers (2023-12-17T11:07:31Z) - A Discrepancy Aware Framework for Robust Anomaly Detection [51.710249807397695]
We present a Discrepancy Aware Framework (DAF), which demonstrates robust performance consistently with simple and cheap strategies.
Our method leverages an appearance-agnostic cue to guide the decoder in identifying defects, thereby alleviating its reliance on synthetic appearance.
Under the simple synthesis strategies, it outperforms existing methods by a large margin. Furthermore, it also achieves the state-of-the-art localization performance.
arXiv Detail & Related papers (2023-10-11T15:21:40Z) - EMBERSim: A Large-Scale Databank for Boosting Similarity Search in
Malware Analysis [48.5877840394508]
In recent years there has been a shift from quantifications-based malware detection towards machine learning.
We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER.
We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space.
arXiv Detail & Related papers (2023-10-03T06:58:45Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks.
It captures discrete variables in the data alongside the weak supervision derived label estimate.
It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z) - New Datasets for Dynamic Malware Classification [0.0]
We introduce two new, updated datasets of malicious software, VirusSamples and VirusShare.
This paper analyzes multi-class malware classification performance of the balanced and imbalanced version of these two datasets.
Results show that Support Vector Machine, achieves the highest score of 94% in the imbalanced VirusSample dataset.
XGBoost, one of the most common gradient boosting-based models, achieves the highest score of 90% and 80%.in both versions of the VirusShare dataset.
arXiv Detail & Related papers (2021-11-30T08:31:16Z) - Being Single Has Benefits. Instance Poisoning to Deceive Malware
Classifiers [47.828297621738265]
We show how an attacker can launch a sophisticated and efficient poisoning attack targeting the dataset used to train a malware classifier.
As opposed to other poisoning attacks in the malware detection domain, our attack does not focus on malware families but rather on specific malware instances that contain an implanted trigger.
We propose a comprehensive detection approach that could serve as a future sophisticated defense against this newly discovered severe threat.
arXiv Detail & Related papers (2020-10-30T15:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.