Benchmarking Deep Learning Fuzzers
- URL: http://arxiv.org/abs/2310.06912v1
- Date: Tue, 10 Oct 2023 18:09:16 GMT
- Title: Benchmarking Deep Learning Fuzzers
- Authors: Nima Shiri Harzevili, Hung Viet Pham, Song Wang
- Abstract summary: We run three state-of-the-art DL fuzzers, FreeFuzz, DeepRel, and DocTer, on the benchmark by following their instructions.
We find that these fuzzers are unable to detect many real bugs collected in our benchmark dataset.
Our systematic analysis further identifies four major, broad, and common factors that affect these fuzzers' ability to detect real bugs.
- Score: 11.118370064698869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we set out to conduct the first ground-truth empirical
evaluation of state-of-the-art DL fuzzers. Specifically, we first manually
created an extensive DL bug benchmark dataset, which includes 627 real-world DL
bugs from TensorFlow and PyTorch libraries reported by users between 2020 and
2022. Then we run three state-of-the-art DL fuzzers, i.e., FreeFuzz, DeepRel,
and DocTer, on the benchmark by following their instructions. We find that
these fuzzers are unable to detect many real bugs collected in our benchmark
dataset. Specifically, most (235) of the 257 applicable bugs cannot be detected
by any fuzzer.
Our systematic analysis further identifies four major, broad, and common
factors that affect these fuzzers' ability to detect real bugs. These findings
present opportunities to improve the performance of the fuzzers in future work.
As a proof of concept, we propose a lightweight corner case generator as an
extension to the three DL fuzzers, which simply covers several boundary values
as well as DL-specific data types. It helps FreeFuzz, DeepRel, and DocTer
detect 12, 12, and 14 more bugs, respectively, that were overlooked by the
original fuzzers. Overall, this work complements prior studies on DL fuzzers
with an extensive performance evaluation and provides a benchmark for future DL
library fuzzing studies. Also, our proposed corner case generator proves that
the fuzzers can be extended to detect more bugs by extending their internal
fuzzing logic based on the insights provided in root cause analysis.
Related papers
- What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - FuzzSlice: Pruning False Positives in Static Analysis Warnings Through
Function-Level Fuzzing [5.748423489074936]
We propose FuzzSlice, a framework that automatically prunes possible false positives among static analysis warnings.
The key insight that we base our work on is that a warning that does not yield a crash when fuzzed at the function level in a given time budget is a possible false positive.
FuzzSlice reduces false positives by 62.26% in the open-source repositories and by 100% in the Juliet dataset.
arXiv Detail & Related papers (2024-02-02T21:49:24Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - Prompt Fuzzing for Fuzz Driver Generation [6.238058387665971]
We propose PromptFuzz, a coverage-guided fuzzer for prompt fuzzing.
It iteratively generates fuzz drivers to explore undiscovered library code.
PromptFuzz achieved 1.61 and 1.63 times higher branch coverage than OSS-Fuzz and Hopper, respectively.
arXiv Detail & Related papers (2023-12-29T16:43:51Z) - HOPPER: Interpretative Fuzzing for Libraries [6.36596812288503]
HOPPER can fuzz libraries without requiring any domain knowledge.
It transforms the problem of library fuzzing into the problem of interpreter fuzzing.
arXiv Detail & Related papers (2023-09-07T06:11:18Z) - What Happens When We Fuzz? Investigating OSS-Fuzz Bug History [0.9772968596463595]
We analyzed 44,102 reported issues made public by OSS-Fuzz prior to March 12, 2022.
We identified the bug-contributing commits to estimate when the bug containing code was introduced, and measure the timeline from introduction to detection to fix.
arXiv Detail & Related papers (2023-05-19T05:15:36Z) - Black-box Dataset Ownership Verification via Backdoor Watermarking [67.69308278379957]
We formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model.
We propose to embed external patterns via backdoor watermarking for the ownership verification to protect them.
Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification.
arXiv Detail & Related papers (2022-08-04T05:32:20Z) - DeFuzz: Deep Learning Guided Directed Fuzzing [41.61500799890691]
We propose a deep learning (DL) guided directed fuzzing for software vulnerability detection, named DeFuzz.
DeFuzz includes two main schemes: (1) we employ a pre-trained DL prediction model to identify the potentially vulnerable functions and the locations (i.e., vulnerable addresses)
Precisely, we employ Bidirectional-LSTM (BiLSTM) to identify attention words, and the vulnerabilities are associated with these attention words in functions.
arXiv Detail & Related papers (2020-10-23T03:44:03Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.