Toward the Detection of Polyglot Files
- URL: http://arxiv.org/abs/2203.07561v2
- Date: Wed, 16 Mar 2022 19:29:39 GMT
- Title: Toward the Detection of Polyglot Files
- Authors: Luke Koch, Sean Oesch, Mary Adkisson, Sam Erwin, Brian Weber, Amul
Chaulagain
- Abstract summary: It is possible to abuse standardized file formats by creating a file that is valid in multiple file formats.
The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.
This is especially problematic for malware detection systems that rely on file format identification for feature extraction.
- Score: 2.7402733069180996
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Standardized file formats play a key role in the development and use of
computer software. However, it is possible to abuse standardized file formats
by creating a file that is valid in multiple file formats. The resulting
polyglot (many languages) file can confound file format identification,
allowing elements of the file to evade analysis.This is especially problematic
for malware detection systems that rely on file format identification for
feature extraction. File format identification processes that depend on file
signatures can be easily evaded thanks to flexibility in the format
specifications of certain file formats. Although work has been done to identify
file formats using more comprehensive methods than file signatures, accurate
identification of polyglot files remains an open problem. Since malware
detection systems routinely perform file format-specific feature extraction,
polyglot files need to be filtered out prior to ingestion by these systems.
Otherwise, malicious content could pass through undetected. To address the
problem of polyglot detection we assembled a data set using the mitra tool. We
then evaluated the performance of the most commonly used file identification
tool, file. Finally, we demonstrated the accuracy, precision, recall and F1
score of a range of machine and deep learning models. Malconv2 and Catboost
demonstrated the highest recall on our data set with 95.16% and 95.34%,
respectively. These models can be incorporated into a malware detector's file
processing pipeline to filter out potentially malicious polyglots before file
format-dependent feature extraction takes place.
Related papers
- On the Abuse and Detection of Polyglot Files [3.6022558854356603]
Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures.
Existing file-format and embedded-file detection tools fail to reliably detect polyglot files used in the wild.
arXiv Detail & Related papers (2024-07-01T17:59:54Z) - Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration [82.88166538896331]
We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression.
We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files.
Results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs.
arXiv Detail & Related papers (2024-05-27T13:09:23Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score.
FoC-Sim outperforms the previous best methods with a 52% higher Recall@1.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - Adversarial Networks and Machine Learning for File Classification [0.0]
Correctly identifying the type of file under examination is a critical part of a forensic investigation.
We propose using an adversarially-trained machine learning neural network to determine a file's true type.
Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types.
arXiv Detail & Related papers (2023-01-27T19:40:03Z) - Watermarking Pre-trained Language Models with Backdooring [118.14981787949199]
We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners.
In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected.
arXiv Detail & Related papers (2022-10-14T05:42:39Z) - Fourier Document Restoration for Robust Document Dewarping and
Recognition [73.44057202891011]
This paper presents FDRNet, a Fourier Document Restoration Network that can restore documents with different distortions.
It dewarps documents by a flexible Thin-Plate Spline transformation which can handle various deformations effectively without requiring deformation annotations in training.
It outperforms the state-of-the-art by large margins on both dewarping and text recognition tasks.
arXiv Detail & Related papers (2022-03-18T12:39:31Z) - FormatFuzzer: Effective Fuzzing of Binary File Formats [11.201540907330436]
We present FormatFuzzer, a generator for format-specific fuzzers.
The format-specific fuzzer can be used as a standalone producer or mutator in black-box settings.
arXiv Detail & Related papers (2021-09-23T10:28:35Z) - Efficient video integrity analysis through container characterization [77.45740041478743]
We introduce a container-based method to identify the software used to perform a video manipulation.
The proposed method is both efficient and effective and can also provide a simple explanation for its decisions.
It achieves an accuracy of 97.6% in distinguishing pristine from tampered videos and classifying the editing software.
arXiv Detail & Related papers (2021-01-26T14:13:39Z) - Content-Based Textual File Type Detection at Scale [0.0]
Programming language detection is a common need in the analysis of large source code bases.
We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content.
arXiv Detail & Related papers (2021-01-21T09:08:42Z) - Short Text Classification Approach to Identify Child Sexual Exploitation
Material [4.415977307120616]
This paper presents two approaches based on short text classification to identify Child Sexual Exploitation Material (CSEM) files.
The presented solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content.
arXiv Detail & Related papers (2020-10-29T09:37:16Z) - Detecting malicious PDF using CNN [46.86114958340962]
Malicious PDF files represent one of the biggest threats to computer security.
We propose a novel algorithm that uses an ensemble of Convolutional Neural Network (CNN) on the byte level of the file.
We show, using a data set of 90000 files downloadable online, that our approach maintains a high detection rate (94%) of PDF malware.
arXiv Detail & Related papers (2020-07-24T18:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.