Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG
- URL: http://arxiv.org/abs/2507.04055v1
- Date: Sat, 05 Jul 2025 14:36:13 GMT
- Title: Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG
- Authors: Yufan Chen, Daoyuan Wu, Juantao Zhong, Zicheng Zhang, Debin Gao, Shuai Wang, Yingjiu Li, Ning Liu,
- Abstract summary: Family-Specific String (FSS) features could be utilized in a manner similar to Retrieval-Augmented Generation (RAG) to facilitate Malware Family Classification (MFC)<n>We develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
- Score: 17.816068374958043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malware Family Classification (MFC) aims to identify the fine-grained family (e.g., GuLoader or BitRAT) to which a potential malware sample belongs, in contrast to malware detection or sample classification that predicts only an Yes/No. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for MFC in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate how Family-Specific String (FSS) features could be utilized in a manner similar to RAG to facilitate MFC. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
Related papers
- MLRan: A Behavioural Dataset for Ransomware Analysis and Detection [0.7706236363202722]
We introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples.<n>The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants.<n>We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan.
arXiv Detail & Related papers (2025-05-24T09:22:53Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process.
With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance.
Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z) - Classification and Online Clustering of Zero-Day Malware [4.409836695738518]
This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them.
Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families.
arXiv Detail & Related papers (2023-05-01T00:00:07Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels [21.050311121388813]
We have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset.
MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset.
We provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools.
arXiv Detail & Related papers (2021-11-29T23:59:50Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - DAEMON: Dataset-Agnostic Explainable Malware Classification Using
Multi-Stage Feature Mining [3.04585143845864]
Malware classification is the task of determining to which family a new malicious variant belongs.
We present DAEMON, a novel dataset-agnostic malware classification tool.
arXiv Detail & Related papers (2020-08-04T21:57:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.