LLM-Generated Samples for Android Malware Detection
- URL: http://arxiv.org/abs/2510.02391v1
- Date: Tue, 30 Sep 2025 23:46:57 GMT
- Title: LLM-Generated Samples for Android Malware Detection
- Authors: Nik Rollinson, Nikolaos Polatidis,
- Abstract summary: We fine-tune GPT-4.1-mini to produce structured records for three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS.<n>We evaluate multiple classifiers under three settings: training with real data only, real-plus-synthetic data, and synthetic data alone.<n>Results show that real-only training achieves near perfect detection, while augmentation with synthetic data preserves high performance with only minor degradations.
- Score: 0.6187780920448871
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Android malware continues to evolve through obfuscation and polymorphism, posing challenges for both signature-based defenses and machine learning models trained on limited and imbalanced datasets. Synthetic data has been proposed as a remedy for scarcity, yet the role of large language models (LLMs) in generating effective malware data for detection tasks remains underexplored. In this study, we fine-tune GPT-4.1-mini to produce structured records for three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS, using the KronoDroid dataset. After addressing generation inconsistencies with prompt engineering and post-processing, we evaluate multiple classifiers under three settings: training with real data only, real-plus-synthetic data, and synthetic data alone. Results show that real-only training achieves near perfect detection, while augmentation with synthetic data preserves high performance with only minor degradations. In contrast, synthetic-only training produces mixed outcomes, with effectiveness varying across malware families and fine-tuning strategies. These findings suggest that LLM-generated malware can enhance scarce datasets without compromising detection accuracy, but remains insufficient as a standalone training source.
Related papers
- Synthetic Data: AI's New Weapon Against Android Malware [0.0]
Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques.<n>MalSynGen is a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic data.<n>This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers.
arXiv Detail & Related papers (2025-11-24T19:27:58Z) - Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios [65.97836905826145]
scarcity of data in various scenarios, such as medical, industry and autonomous driving, leads to model overfitting and dataset imbalance.<n>We propose Crucial-Diff, a domain-agnostic framework designed to synthesize crucial samples.<n>Our framework generates diverse, high-quality training data, achieving a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec.
arXiv Detail & Related papers (2025-07-14T04:41:38Z) - The Comparability of Model Fusion to Measured Data in Confuser Rejection [0.24578723416255746]
No dataset can account for every slight deviation we might see in live usage.<n>Simulators have been developed utilizing the shooting and bouncing ray method to allow for the generation of synthetic SAR data on 3D models.<n>We aim to use computational power as a substitution for this lack of quality measured data, by ensembling many models trained on synthetic data.
arXiv Detail & Related papers (2025-05-01T19:51:30Z) - The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text [23.412546862849396]
We assume an adversary has access to some synthetic data generated by a Large Language Models (LLMs)<n>We design membership inference attacks (MIAs) that target the training data used to fine-tune the LLM that is then used to synthesize data.<n>We find that canaries crafted for model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released.
arXiv Detail & Related papers (2025-02-19T15:30:30Z) - Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [65.23593936798662]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - Evaluating Language Models as Synthetic Data Generators [99.16334775127875]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.<n>Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models.
textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset.
This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.