A Feature Set of Small Size for the PDF Malware Detection
- URL: http://arxiv.org/abs/2308.04704v2
- Date: Thu, 10 Aug 2023 03:08:03 GMT
- Title: A Feature Set of Small Size for the PDF Malware Detection
- Authors: Ran Liu and Charles Nicholas
- Abstract summary: We propose a small features set that don't require too much domain knowledge of the PDF file.
We report the best accuracy of 99.75% when using Random Forest model.
Despite its modest size, we obtain comparable results to state-of-the-art that employ a much larger set of features.
- Score: 8.282177703075451
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML)-based malware detection systems are becoming
increasingly important as malware threats increase and get more sophisticated.
PDF files are often used as vectors for phishing attacks because they are
widely regarded as trustworthy data resources, and are accessible across
different platforms. Therefore, researchers have developed many different PDF
malware detection methods. Performance in detecting PDF malware is greatly
influenced by feature selection. In this research, we propose a small features
set that don't require too much domain knowledge of the PDF file. We evaluate
proposed features with six different machine learning models. We report the
best accuracy of 99.75% when using Random Forest model. Our proposed feature
set, which consists of just 12 features, is one of the most conciseness in the
field of PDF malware detection. Despite its modest size, we obtain comparable
results to state-of-the-art that employ a much larger set of features.
Related papers
- PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - Detecting new obfuscated malware variants: A lightweight and interpretable machine learning approach [0.0]
We present a machine learning-based system for detecting obfuscated malware that is highly accurate, lightweight and interpretable.
Our system is capable of detecting 15 malware subtypes despite being exclusively trained on one malware subtype, namely the Transponder from the Spyware family.
The Transponder-focused model exhibited high accuracy, exceeding 99.8%, with an average processing speed of 5.7 microseconds per file.
arXiv Detail & Related papers (2024-07-07T12:41:40Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified
Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection.
Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables.
We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z) - Investigating Feature and Model Importance in Android Malware Detection: An Implemented Survey and Experimental Comparison of ML-Based Methods [2.9248916859490173]
We show that high detection accuracies can be achieved using features extracted through static analysis alone.
Random forests are generally the most effective model, outperforming more complex deep learning approaches.
arXiv Detail & Related papers (2023-01-30T10:48:10Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Mate! Are You Really Aware? An Explainability-Guided Testing Framework
for Robustness of Malware Detectors [49.34155921877441]
We propose an explainability-guided and model-agnostic testing framework for robustness of malware detectors.
We then use this framework to test several state-of-the-art malware detectors' abilities to detect manipulated malware.
Our findings shed light on the limitations of current malware detectors, as well as how they can be improved.
arXiv Detail & Related papers (2021-11-19T08:02:38Z) - HAPSSA: Holistic Approach to PDF Malware Detection Using Signal and
Statistical Analysis [16.224649756613655]
Malicious PDF documents present a serious threat to various security organizations.
State-of-the-art approaches use machine learning (ML) to learn features that characterize PDF malware.
In this paper, we derive a simple yet effective holistic approach to PDF malware detection.
arXiv Detail & Related papers (2021-11-08T18:32:47Z) - PDF-Malware: An Overview on Threats, Detection and Evasion Attacks [0.966840768820136]
The widespread use of PDF has installed a false impression of inherent safety among benign users.
In this work, we give an overview on the PDF-malware detection problem.
arXiv Detail & Related papers (2021-07-27T15:15:20Z) - Detecting malicious PDF using CNN [46.86114958340962]
Malicious PDF files represent one of the biggest threats to computer security.
We propose a novel algorithm that uses an ensemble of Convolutional Neural Network (CNN) on the byte level of the file.
We show, using a data set of 90000 files downloadable online, that our approach maintains a high detection rate (94%) of PDF malware.
arXiv Detail & Related papers (2020-07-24T18:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.