On the Abuse and Detection of Polyglot Files
- URL: http://arxiv.org/abs/2407.01529v1
- Date: Mon, 1 Jul 2024 17:59:54 GMT
- Title: On the Abuse and Detection of Polyglot Files
- Authors: Luke Koch, Sean Oesch, Amul Chaulagain, Jared Dixon, Matthew Dixon, Mike Huettal, Amir Sadovnik, Cory Watson, Brian Weber, Jacob Hartman, Richard Patulski,
- Abstract summary: Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures.
Existing file-format and embedded-file detection tools fail to reliably detect polyglot files used in the wild.
- Score: 3.6022558854356603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.
Related papers
- FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [54.27040631527217]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.
FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score.
FoC-Sim outperforms the previous best methods with a 52% higher Recall@1.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - GlotScript: A Resource and Tool for Low Resource Writing System Identification [53.56700754408902]
GlotScript is an open resource for low resource writing system identification.
GlotScript-R provides attested writing systems for more than 7,000 languages.
GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts.
arXiv Detail & Related papers (2023-09-23T09:35:55Z) - Robust Multi-bit Natural Language Watermarking through Invariant
Features [28.4935678626116]
Original natural language contents are susceptible to illegal piracy and potential misuse.
To effectively combat piracy and protect copyrights, a multi-bit watermarking framework should be able to embed adequate bits of information.
In this work, we explore ways to advance both payload and robustness by following a well-known proposition from image watermarking.
arXiv Detail & Related papers (2023-05-03T05:37:30Z) - Toward the Detection of Polyglot Files [2.7402733069180996]
It is possible to abuse standardized file formats by creating a file that is valid in multiple file formats.
The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis.
This is especially problematic for malware detection systems that rely on file format identification for feature extraction.
arXiv Detail & Related papers (2022-03-14T23:48:22Z) - Automatic Polyp Segmentation via Multi-scale Subtraction Network [100.94922587360871]
In clinical practice, precise polyp segmentation provides important information in the early detection of colorectal cancer.
Most existing methods are based on U-shape structure and use element-wise addition or concatenation to fuse different level features progressively in decoder.
We propose a multi-scale subtraction network (MSNet) to segment polyp from colonoscopy image.
arXiv Detail & Related papers (2021-08-11T07:54:07Z) - D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z) - Content-Based Textual File Type Detection at Scale [0.0]
Programming language detection is a common need in the analysis of large source code bases.
We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content.
arXiv Detail & Related papers (2021-01-21T09:08:42Z) - Beyond the Hype: A Real-World Evaluation of the Impact and Cost of
Machine Learning-Based Malware Detection [5.876081415416375]
There is a lack of scientific testing of commercially available malware detectors.
We present a scientific evaluation of four market-leading malware detection tools.
Our results show that all four tools have near-perfect precision but alarmingly low recall.
arXiv Detail & Related papers (2020-12-16T19:10:00Z) - Discovering Bilingual Lexicons in Polyglot Word Embeddings [32.53342453685406]
In this work, we utilize a single Skip-gram model trained on a multilingual corpus yielding polyglot word embeddings.
We present a novel finding that a surprisingly simple constrained nearest-neighbor sampling technique can retrieve bilingual lexicons.
Across three European language pairs, we observe that polyglot word embeddings indeed learn a rich semantic representation of words.
arXiv Detail & Related papers (2020-08-31T03:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.