Semantic Preprocessing for LLM-based Malware Analysis
- URL: http://arxiv.org/abs/2506.12113v3
- Date: Thu, 26 Jun 2025 15:09:42 GMT
- Title: Semantic Preprocessing for LLM-based Malware Analysis
- Authors: Benjamin Marais, Tony Quertier, Grégoire Barrue,
- Abstract summary: We propose a new preprocessing method which creates reports for Portable Executable files.<n>The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts.<n>Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a context of malware analysis, numerous approaches rely on Artificial Intelligence to handle a large volume of data. However, these techniques focus on data view (images, sequences) and not on an expert's view. Noticing this issue, we propose a preprocessing that focuses on expert knowledge to improve malware semantic analysis and result interpretability. We propose a new preprocessing method which creates JSON reports for Portable Executable files. These reports gather features from both static and behavioral analysis, and incorporate packer signature detection, MITRE ATT\&CK and Malware Behavior Catalog (MBC) knowledge. The purpose of this preprocessing is to gather a semantic representation of binary files, understandable by malware analysts, and that can enhance AI models' explainability for malicious files analysis. Using this preprocessing to train a Large Language Model for Malware classification, we achieve a weighted-average F1-score of 0.94 on a complex dataset, representative of market reality.
Related papers
- ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models [82.04858317800097]
We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts.<n>ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues.<n>We introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images.
arXiv Detail & Related papers (2025-08-02T15:21:26Z) - Enhanced Consistency Bi-directional GAN(CBiGAN) for Malware Anomaly Detection [0.25163931116642785]
This paper introduces the application of the CBiGAN in the domain of malware anomaly detection.<n>We utilize several datasets including both portable executable (PE) files as well as Object Linking and Embedding (OLE) files.<n>We then evaluated our model against a diverse set of both PE and OLE files, including self-collected malicious executables from 214 malware families.
arXiv Detail & Related papers (2025-06-09T02:43:25Z) - ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling [39.68433051199151]
Family labeling is an essential component of cyberattack investigation, attribution, and remediation.<n>ClarAVy is a tool to determine the family to which a malicious file belongs.<n>ClarAVy has 8 and 12 percentage points higher accuracy than the prior leading tool in labeling the MOTIF and MalPedia datasets.
arXiv Detail & Related papers (2025-02-04T22:55:39Z) - Exploring Large Language Models for Semantic Analysis and Categorization of Android Malware [0.0]
msp is designed to augment malware analysis for Android through a hierarchical-tiered summarization chain and strategic prompt engineering.<n>msp can achieve up to 77% classification accuracy while providing highly robust summaries at functional, class, and package levels.
arXiv Detail & Related papers (2025-01-08T21:22:45Z) - A Lean Transformer Model for Dynamic Malware Analysis and Detection [0.0]
Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue.
Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from execution reports.
In this paper, we design an emulation-Only model, based on the Transformers architecture, to detect malicious files.
arXiv Detail & Related papers (2024-08-05T08:46:46Z) - Semantic Data Representation for Explainable Windows Malware Detection Models [0.0]
We propose PE Malware Ontology that offers a reusable semantic schema for Portable Executable (PE - the Windows binary format) malware files.
This ontology is inspired by the structure of the EMBER dataset, which focuses on the static malware analysis of PE files.
We also publish semantically treated EMBER data, including fractional datasets, to support experiments on EMBER.
arXiv Detail & Related papers (2024-03-18T11:17:27Z) - Prompt Engineering-assisted Malware Dynamic Analysis Using GPT-4 [45.935748395725206]
We introduce a prompt engineering-assisted malware dynamic analysis using GPT-4.
In this method, GPT-4 is employed to create explanatory text for each API call within the API sequence.
BERT is used to obtain the representation of the text, from which we derive the representation of the API sequence.
arXiv Detail & Related papers (2023-12-13T17:39:44Z) - MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers [44.700094741798445]
Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family.
We have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with.
We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total.
arXiv Detail & Related papers (2023-10-18T04:36:26Z) - EMBERSim: A Large-Scale Databank for Boosting Similarity Search in
Malware Analysis [48.5877840394508]
In recent years there has been a shift from quantifications-based malware detection towards machine learning.
We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER.
We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space.
arXiv Detail & Related papers (2023-10-03T06:58:45Z) - DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified
Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection.
Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables.
We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z) - Towards an Automated Pipeline for Detecting and Classifying Malware
through Machine Learning [0.0]
We propose a malware taxonomic classification pipeline able to classify Windows Portable Executable files (PEs)
Given an input PE sample, it is first classified as either malicious or benign.
If malicious, the pipeline further analyzes it in order to establish its threat type, family, and behavior(s)
arXiv Detail & Related papers (2021-06-10T10:07:50Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Being Single Has Benefits. Instance Poisoning to Deceive Malware
Classifiers [47.828297621738265]
We show how an attacker can launch a sophisticated and efficient poisoning attack targeting the dataset used to train a malware classifier.
As opposed to other poisoning attacks in the malware detection domain, our attack does not focus on malware families but rather on specific malware instances that contain an implanted trigger.
We propose a comprehensive detection approach that could serve as a future sophisticated defense against this newly discovered severe threat.
arXiv Detail & Related papers (2020-10-30T15:27:44Z) - Adversarial EXEmples: A Survey and Experimental Evaluation of Practical
Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes.
We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks.
These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.