Related papers: Toward the Detection of Polyglot Files

Toward the Detection of Polyglot Files

URL: http://arxiv.org/abs/2203.07561v2
Date: Wed, 16 Mar 2022 19:29:39 GMT
Title: Toward the Detection of Polyglot Files
Authors: Luke Koch, Sean Oesch, Mary Adkisson, Sam Erwin, Brian Weber, Amul Chaulagain
Abstract summary: It is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis. This is especially problematic for malware detection systems that rely on file format identification for feature extraction.
Score: 2.7402733069180996
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Standardized file formats play a key role in the development and use of computer software. However, it is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.This is especially problematic for malware detection systems that rely on file format identification for feature extraction. File format identification processes that depend on file signatures can be easily evaded thanks to flexibility in the format specifications of certain file formats. Although work has been done to identify file formats using more comprehensive methods than file signatures, accurate identification of polyglot files remains an open problem. Since malware detection systems routinely perform file format-specific feature extraction, polyglot files need to be filtered out prior to ingestion by these systems. Otherwise, malicious content could pass through undetected. To address the problem of polyglot detection we assembled a data set using the mitra tool. We then evaluated the performance of the most commonly used file identification tool, file. Finally, we demonstrated the accuracy, precision, recall and F1 score of a range of machine and deep learning models. Malconv2 and Catboost demonstrated the highest recall on our data set with 95.16% and 95.34%, respectively. These models can be incorporated into a malware detector's file processing pipeline to filter out potentially malicious polyglots before file format-dependent feature extraction takes place.

Related papers

On the Abuse and Detection of Polyglot Files [3.6022558854356603]
Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures. Existing file-format and embedded-file detection tools fail to reliably detect polyglot files used in the wild.
arXiv Detail & Related papers (2024-07-01T17:59:54Z)
Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration [82.88166538896331]
We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression. We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files. Results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs.
arXiv Detail & Related papers (2024-05-27T13:09:23Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries. We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language. We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
Adversarial Networks and Machine Learning for File Classification [0.0]
Correctly identifying the type of file under examination is a critical part of a forensic investigation. We propose using an adversarially-trained machine learning neural network to determine a file's true type. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types.
arXiv Detail & Related papers (2023-01-27T19:40:03Z)
Watermarking Pre-trained Language Models with Backdooring [118.14981787949199]
We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected.
arXiv Detail & Related papers (2022-10-14T05:42:39Z)
Fourier Document Restoration for Robust Document Dewarping and Recognition [73.44057202891011]
This paper presents FDRNet, a Fourier Document Restoration Network that can restore documents with different distortions. It dewarps documents by a flexible Thin-Plate Spline transformation which can handle various deformations effectively without requiring deformation annotations in training. It outperforms the state-of-the-art by large margins on both dewarping and text recognition tasks.
arXiv Detail & Related papers (2022-03-18T12:39:31Z)
FormatFuzzer: Effective Fuzzing of Binary File Formats [11.201540907330436]
We present FormatFuzzer, a generator for format-specific fuzzers. The format-specific fuzzer can be used as a standalone producer or mutator in black-box settings.
arXiv Detail & Related papers (2021-09-23T10:28:35Z)
Efficient video integrity analysis through container characterization [77.45740041478743]
We introduce a container-based method to identify the software used to perform a video manipulation. The proposed method is both efficient and effective and can also provide a simple explanation for its decisions. It achieves an accuracy of 97.6% in distinguishing pristine from tampered videos and classifying the editing software.
arXiv Detail & Related papers (2021-01-26T14:13:39Z)
Content-Based Textual File Type Detection at Scale [0.0]
Programming language detection is a common need in the analysis of large source code bases. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content.
arXiv Detail & Related papers (2021-01-21T09:08:42Z)
Short Text Classification Approach to Identify Child Sexual Exploitation Material [4.415977307120616]
This paper presents two approaches based on short text classification to identify Child Sexual Exploitation Material (CSEM) files. The presented solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content.
arXiv Detail & Related papers (2020-10-29T09:37:16Z)
Detecting malicious PDF using CNN [46.86114958340962]
Malicious PDF files represent one of the biggest threats to computer security. We propose a novel algorithm that uses an ensemble of Convolutional Neural Network (CNN) on the byte level of the file. We show, using a data set of 90000 files downloadable online, that our approach maintains a high detection rate (94%) of PDF malware.
arXiv Detail & Related papers (2020-07-24T18:27:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.