Augmenting Greybox Fuzzing with Generative AI
- URL: http://arxiv.org/abs/2306.06782v1
- Date: Sun, 11 Jun 2023 21:44:47 GMT
- Title: Augmenting Greybox Fuzzing with Generative AI
- Authors: Jie Hu (University of California Riverside), Qian Zhang (University of
California Riverside), Heng Yin (University of California Riverside)
- Abstract summary: We propose ChatFuzz, a greybox fuzzer augmented by generative AI.
We conduct extensive experiments to explore the best practice for harvesting the power of generative LLM models.
Experiment results show that our approach improves the edge coverage by 12.77% over the SOTA greybox fuzzer.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Real-world programs expecting structured inputs often has a format-parsing
stage gating the deeper program space. Neither a mutation-based approach nor a
generative approach can provide a solution that is effective and scalable.
Large language models (LLM) pre-trained with an enormous amount of natural
language corpus have proved to be effective for understanding the implicit
format syntax and generating format-conforming inputs. In this paper, propose
ChatFuzz, a greybox fuzzer augmented by generative AI. More specifically, we
pick a seed in the fuzzer's seed pool and prompt ChatGPT generative models to
variations, which are more likely to be format-conforming and thus of high
quality. We conduct extensive experiments to explore the best practice for
harvesting the power of generative LLM models. The experiment results show that
our approach improves the edge coverage by 12.77\% over the SOTA greybox fuzzer
(AFL++) on 12 target programs from three well-tested benchmarks. As for
vulnerability detection, \sys is able to perform similar to or better than
AFL++ for programs with explicit syntax rules but not for programs with
non-trivial syntax.
Related papers
- Generator-Based Fuzzers with Type-Based Targeted Mutation [1.4507298892594764]
In previous work, coverage-guided fuzzers used a mix of static analysis, taint analysis, and constraint-solving approaches to address this problem.
In this paper, we introduce a type-based mutation, along with constant string lookup, for Java GBF.
Results compared to a baseline GBF tool show an almost 20% average improvement in application coverage, and larger improvements when third-party code is included.
arXiv Detail & Related papers (2024-06-04T07:20:13Z) - Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming [8.34623776815378]
We curate a dataset of 600K lines of open-source F* programs and proofs.
Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem.
We investigate the use of AI to synthesize programs and their proofs in F*, with promising results.
arXiv Detail & Related papers (2024-05-03T00:14:33Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation [25.474639218436916]
We use a language-model-infused scaffolding program to improve itself.
A variety of self-improvement strategies are proposed by the language model.
It demonstrates that a modern language model, GPT-4, is capable of writing code that can call itself to improve itself.
arXiv Detail & Related papers (2023-10-03T17:59:32Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - The Wisdom of Hindsight Makes Language Models Better Instruction
Followers [84.9120606803906]
Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback.
In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner.
We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions.
arXiv Detail & Related papers (2023-02-10T12:16:38Z) - Inflected Forms Are Redundant in Question Generation Models [27.49894653349779]
We propose an approach to enhance the performance of Question Generation using an encoder-decoder framework.
Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words.
Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type.
arXiv Detail & Related papers (2023-01-01T13:08:11Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - Natural Language to Code Translation with Execution [82.52142893010563]
Execution result--minimum Bayes risk decoding for program selection.
We show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks.
arXiv Detail & Related papers (2022-04-25T06:06:08Z) - Deep Continuous Prompt for Contrastive Learning of Sentence Embeddings [8.70715711885114]
We present a novel method which freezes the whole language model and only optimize the prefix deep continuous prompts.
It not only tunes around 0.1% parameters of the original language model, but avoids the cumbersome computation of searching handcrafted prompts.
Our proposed DCPCSE outperforms the state-of-the-art method SimCSE by a large margin.
arXiv Detail & Related papers (2022-03-14T06:07:44Z) - Imputer: Sequence Modelling via Imputation and Dynamic Programming [101.5705527605346]
Imputer is an iterative generative model, requiring only a constant number of generation steps independent of the number of input or output tokens.
We present a tractable dynamic programming training algorithm, which yields a lower bound on the log marginal likelihood.
arXiv Detail & Related papers (2020-02-20T18:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.