Related papers: Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model

Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model

URL: http://arxiv.org/abs/2506.17162v1
Date: Fri, 20 Jun 2025 17:08:08 GMT
Title: Analyzing PDFs like Binaries: Adversarially Robust PDF Malware Analysis via Intermediate Representation and Language Model
Authors: Side Liu, Jiang Ming, Guodong Zhou, Xinyi Liu, Jianming Fu, Guojun Peng,
Abstract summary: Malicious PDF files have emerged as a persistent threat and become a popular attack vector in web-based attacks.<n> PDF malwares are often susceptible to adversarial attacks, undermining their reliability.<n>We propose a novel approach for PDF feature extraction and PDF malware detection.
Score: 27.85605747467984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Malicious PDF files have emerged as a persistent threat and become a popular attack vector in web-based attacks. While machine learning-based PDF malware classifiers have shown promise, these classifiers are often susceptible to adversarial attacks, undermining their reliability. To address this issue, recent studies have aimed to enhance the robustness of PDF classifiers. Despite these efforts, the feature engineering underlying these studies remains outdated. Consequently, even with the application of cutting-edge machine learning techniques, these approaches fail to fundamentally resolve the issue of feature instability. To tackle this, we propose a novel approach for PDF feature extraction and PDF malware detection. We introduce the PDFObj IR (PDF Object Intermediate Representation), an assembly-like language framework for PDF objects, from which we extract semantic features using a pretrained language model. Additionally, we construct an Object Reference Graph to capture structural features, drawing inspiration from program analysis. This dual approach enables us to analyze and detect PDF malware based on both semantic and structural features. Experimental results demonstrate that our proposed classifier achieves strong adversarial robustness while maintaining an exceptionally low false positive rate of only 0.07% on baseline dataset compared to state-of-the-art PDF malware classifiers.

Related papers

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
MASKDROID: Robust Android Malware Detection with Masked Graph Representations [56.09270390096083]
We propose MASKDROID, a powerful detector with a strong discriminative ability to identify malware. We introduce a masking mechanism into the Graph Neural Network based framework, forcing MASKDROID to recover the whole input graph. This strategy enables the model to understand the malicious semantics and learn more stable representations, enhancing its robustness against adversarial attacks.
arXiv Detail & Related papers (2024-09-29T07:22:47Z)
A Lean Transformer Model for Dynamic Malware Analysis and Detection [0.0]
Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue. Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from execution reports. In this paper, we design an emulation-Only model, based on the Transformers architecture, to detect malicious files.
arXiv Detail & Related papers (2024-08-05T08:46:46Z)
A Feature Set of Small Size for the PDF Malware Detection [8.282177703075451]
We propose a small features set that don't require too much domain knowledge of the PDF file. We report the best accuracy of 99.75% when using Random Forest model. Despite its modest size, we obtain comparable results to state-of-the-art that employ a much larger set of features.
arXiv Detail & Related papers (2023-08-09T04:51:28Z)
DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z)
HAPSSA: Holistic Approach to PDF Malware Detection Using Signal and Statistical Analysis [16.224649756613655]
Malicious PDF documents present a serious threat to various security organizations. State-of-the-art approaches use machine learning (ML) to learn features that characterize PDF malware. In this paper, we derive a simple yet effective holistic approach to PDF malware detection.
arXiv Detail & Related papers (2021-11-08T18:32:47Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
PDF-Malware: An Overview on Threats, Detection and Evasion Attacks [0.966840768820136]
The widespread use of PDF has installed a false impression of inherent safety among benign users. In this work, we give an overview on the PDF-malware detection problem.
arXiv Detail & Related papers (2021-07-27T15:15:20Z)
Adaptive Feature Alignment for Adversarial Training [56.17654691470554]
CNNs are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications. We propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths. Our method is trained to automatically align features of arbitrary attacking strength.
arXiv Detail & Related papers (2021-05-31T17:01:05Z)
Detection of Adversarial Supports in Few-shot Classifiers Using Feature Preserving Autoencoders and Self-Similarity [89.26308254637702]
We propose a detection strategy to highlight adversarial support sets. We make use of feature preserving autoencoder filtering and also the concept of self-similarity of a support set to perform this detection. Our method is attack-agnostic and also the first to explore detection for few-shot classifiers to the best of our knowledge.
arXiv Detail & Related papers (2020-12-09T14:13:41Z)
Adversarial EXEmples: A Survey and Experimental Evaluation of Practical Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes. We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks. These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z)
Explanation-Guided Backdoor Poisoning Attacks Against Malware Classifiers [12.78844634194129]
Training pipelines for machine learning based malware classification often rely on crowdsourced threat feeds. This paper focuses on challenging "clean label" attacks where attackers do not control the sample labeling process. We propose the use of techniques from explainable machine learning to guide the selection of relevant features and values to create effective backdoor triggers.
arXiv Detail & Related papers (2020-03-02T17:04:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.