Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG
- URL: http://arxiv.org/abs/2507.04055v2
- Date: Sun, 26 Oct 2025 15:01:13 GMT
- Title: Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG
- Authors: Yufan Chen, Daoyuan Wu, Juantao Zhong, Zicheng Zhang, Debin Gao, Shuai Wang, Yingjiu Li, Ning Liu, Jiachi Chen, Rocky K. C. Chang,
- Abstract summary: Family-Specific String (FSS) features can be utilized in a manner similar to Retrieval-Augmented Generation (RAG) to facilitate family classification.<n>We develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices.
- Score: 41.02368814412595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malware family classification aims to identify the specific family (e.g., GuLoader or BitRAT) a malware sample may belong to, in contrast to malware detection or sample classification, which only predicts a Yes/No outcome. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for family classification in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate howFamily-Specific String (FSS) features can be utilized in a manner similar to RAG to facilitate family classification. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules, with each providing a relative improvement ranging from 8.1% to 120%.
Related papers
- AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research [81.04845910798387]
Generating natural language explanations for threat detections remains an open problem in cybersecurity research.<n>We present AutoMalDesc, an automated static analysis summarization framework that operates independently at scale.<n>We publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) datasets, along with our methodology and evaluation framework.
arXiv Detail & Related papers (2025-11-17T13:05:25Z) - RawMal-TF: Raw Malware Dataset Labeled by Type and Family [1.2289361708127875]
This work addresses the challenge of malware classification using machine learning by developing a novel dataset labeled at both the malware type and family levels.<n>The dataset includes 14 malware types and 17 malware families, and was processed using a unified feature extraction pipeline.<n>In the binary classification of malware versus benign samples, Random Forest and XGBoost achieved high accuracy on the full datasets.
arXiv Detail & Related papers (2025-06-30T14:38:01Z) - MLRan: A Behavioural Dataset for Ransomware Analysis and Detection [0.7706236363202722]
We introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples.<n>The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants.<n>We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan.
arXiv Detail & Related papers (2025-05-24T09:22:53Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection [34.7994627734601]
We propose a novel hierarchical semi-supervised algorithm, which can be used in the early stages of the malware family labeling process.
With HNMFk, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance.
Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families.
arXiv Detail & Related papers (2023-09-12T23:45:59Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z) - Classification and Online Clustering of Zero-Day Malware [4.409836695738518]
This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them.
Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families.
arXiv Detail & Related papers (2023-05-01T00:00:07Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels [21.050311121388813]
We have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset.
MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset.
We provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools.
arXiv Detail & Related papers (2021-11-29T23:59:50Z) - Cluster Analysis of Malware Family Relationships [4.111899441919165]
We consider a dataset comprising20 malware families with1000 samples per family.
We perform clustering based on pairs of families and use the results to determine relationships between families.
Our results indicate that $K$-means clustering can be a powerful tool for data exploration of malware family relationships.
arXiv Detail & Related papers (2021-03-07T14:51:01Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - DAEMON: Dataset-Agnostic Explainable Malware Classification Using
Multi-Stage Feature Mining [3.04585143845864]
Malware classification is the task of determining to which family a new malicious variant belongs.
We present DAEMON, a novel dataset-agnostic malware classification tool.
arXiv Detail & Related papers (2020-08-04T21:57:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.