Related papers: FormatFuzzer: Effective Fuzzing of Binary File Formats

FormatFuzzer: Effective Fuzzing of Binary File Formats

URL: http://arxiv.org/abs/2109.11277v3
Date: Wed, 27 Sep 2023 12:57:33 GMT
Title: FormatFuzzer: Effective Fuzzing of Binary File Formats
Authors: Rafael Dutra, Rahul Gopinath, Andreas Zeller
Abstract summary: We present FormatFuzzer, a generator for format-specific fuzzers. The format-specific fuzzer can be used as a standalone producer or mutator in black-box settings.
Score: 11.201540907330436
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Effective fuzzing of programs that process structured binary inputs, such as multimedia files, is a challenging task, since those programs expect a very specific input format. Existing fuzzers, however, are mostly format-agnostic, which makes them versatile, but also ineffective when a specific format is required. We present FormatFuzzer, a generator for format-specific fuzzers. FormatFuzzer takes as input a binary template (a format specification used by the 010 Editor) and compiles it into C++ code that acts as parser, mutator, and highly efficient generator of inputs conforming to the rules of the language. The resulting format-specific fuzzer can be used as a standalone producer or mutator in black-box settings, where no guidance from the program is available. In addition, by providing mutable decision seeds, it can be easily integrated with arbitrary format-agnostic fuzzers such as AFL to make them format-aware. In our evaluation on complex formats such as MP4 or ZIP, FormatFuzzer showed to be a highly effective producer of valid inputs that also detected previously unknown memory errors in ffmpeg and timidity.

Related papers

Low-Cost and Comprehensive Non-textual Input Fuzzing with LLM-Synthesized Input Generators [25.199440800244442]
We present a novel approach to enabling grammar-aware fuzzing over non-textual inputs. LLMs are good at synthesizing and mutating input generators and enabling jumping out of local optima. G2FUZZ outperforms SOTA tools such as AFL++, Fuzztruction, and FormatFuzzer in terms of code coverage and bug finding.
arXiv Detail & Related papers (2025-01-31T16:45:16Z)
FuzzCoder: Byte-level Fuzzing Test via Large Language Model [46.18191648883695]
We propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program.
arXiv Detail & Related papers (2024-09-03T14:40:31Z)
Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration [82.88166538896331]
We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression. We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files. Results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs.
arXiv Detail & Related papers (2024-05-27T13:09:23Z)
3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers [5.102523342662388]
3DGen is a framework that makes use of AI agents to transform mixed informal input into format specifications in a language called 3D. 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C.
arXiv Detail & Related papers (2024-04-16T07:53:28Z)
Beyond Language Models: Byte Models are Digital World Simulators [68.91268999567473]
bGPT is a model with next byte prediction to simulate the digital world. It matches specialized models in performance across various modalities, including text, audio, and images. It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte.
arXiv Detail & Related papers (2024-02-29T13:38:07Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
MambaByte: Token-free Selective State Space Model [71.90159903595514]
MambaByte is a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences. We show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks.
arXiv Detail & Related papers (2024-01-24T18:53:53Z)
Augmenting Greybox Fuzzing with Generative AI [0.0]
We propose ChatFuzz, a greybox fuzzer augmented by generative AI. We conduct extensive experiments to explore the best practice for harvesting the power of generative LLM models. Experiment results show that our approach improves the edge coverage by 12.77% over the SOTA greybox fuzzer.
arXiv Detail & Related papers (2023-06-11T21:44:47Z)
Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z)
Toward the Detection of Polyglot Files [2.7402733069180996]
It is possible to abuse standardized file formats by creating a file that is valid in multiple file formats. The resulting polyglot (many languages) file can confound file format identification, allowing elements of the file to evade analysis. This is especially problematic for malware detection systems that rely on file format identification for feature extraction.
arXiv Detail & Related papers (2022-03-14T23:48:22Z)
Leader: Prefixing a Length for Faster Word Vector Serialization [11.112281331309939]
Two file formats are used to distribute pre-trained word embeddings. The GloVe format is a text based format that suffers from huge file sizes and slow reads. The word2vec format is a smaller binary format that mixes a textual representation of words with a binary representation of the vectors themselves.
arXiv Detail & Related papers (2020-09-29T00:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.