Related papers: On the Abuse and Detection of Polyglot Files

On the Abuse and Detection of Polyglot Files

URL: http://arxiv.org/abs/2407.01529v1
Date: Mon, 1 Jul 2024 17:59:54 GMT
Title: On the Abuse and Detection of Polyglot Files
Authors: Luke Koch, Sean Oesch, Amul Chaulagain, Jared Dixon, Matthew Dixon, Mike Huettal, Amir Sadovnik, Cory Watson, Brian Weber, Jacob Hartman, Richard Patulski,
Abstract summary: Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures. Existing file-format and embedded-file detection tools fail to reliably detect polyglot files used in the wild.
Score: 3.6022558854356603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.

Related papers

ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
SCORE: Syntactic Code Representations for Static Script Malware Detection [9.502104012686491]
Server-side script attacks can steal data, compromise credentials, and disrupt operations. We propose novel feature extraction and deep learning (DL)-based approaches for static script malware detection. Our approach achieves a true positive rate (TPR) up to 81% higher than leading signature-based antivirus solutions.
arXiv Detail & Related papers (2024-11-12T20:58:04Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries. We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language. We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
GlotScript: A Resource and Tool for Low Resource Writing System Identification [53.56700754408902]
GlotScript is an open resource for low resource writing system identification. GlotScript-R provides attested writing systems for more than 7,000 languages. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts.
arXiv Detail & Related papers (2023-09-23T09:35:55Z)
Robust Multi-bit Natural Language Watermarking through Invariant Features [28.4935678626116]
Original natural language contents are susceptible to illegal piracy and potential misuse. To effectively combat piracy and protect copyrights, a multi-bit watermarking framework should be able to embed adequate bits of information. In this work, we explore ways to advance both payload and robustness by following a well-known proposition from image watermarking.
arXiv Detail & Related papers (2023-05-03T05:37:30Z)
Toward the Detection of Polyglot Files [2.7402733069180996]
It is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis. This is especially problematic for malware detection systems that rely on file format identification for feature extraction.
arXiv Detail & Related papers (2022-03-14T23:48:22Z)
Automatic Polyp Segmentation via Multi-scale Subtraction Network [100.94922587360871]
In clinical practice, precise polyp segmentation provides important information in the early detection of colorectal cancer. Most existing methods are based on U-shape structure and use element-wise addition or concatenation to fuse different level features progressively in decoder. We propose a multi-scale subtraction network (MSNet) to segment polyp from colonoscopy image.
arXiv Detail & Related papers (2021-08-11T07:54:07Z)
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)
Content-Based Textual File Type Detection at Scale [0.0]
Programming language detection is a common need in the analysis of large source code bases. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content.
arXiv Detail & Related papers (2021-01-21T09:08:42Z)
Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection [5.876081415416375]
There is a lack of scientific testing of commercially available malware detectors. We present a scientific evaluation of four market-leading malware detection tools. Our results show that all four tools have near-perfect precision but alarmingly low recall.
arXiv Detail & Related papers (2020-12-16T19:10:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.