Related papers: SoK: Prudent Evaluation Practices for Fuzzing

SoK: Prudent Evaluation Practices for Fuzzing

URL: http://arxiv.org/abs/2405.10220v1
Date: Thu, 16 May 2024 16:10:41 GMT
Title: SoK: Prudent Evaluation Practices for Fuzzing
Authors: Moritz Schloegel, Nils Bars, Nico Schiller, Lukas Bernhard, Tobias Scharnowski, Addison Crump, Arash Ale Ebrahim, Nicolai Bissantz, Marius Muench, Thorsten Holz,
Abstract summary: We systematically analyze the evaluation of 150 fuzzing papers published between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. For example, when investigating reported bugs, we find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations.
Score: 21.113311952857778
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, ...

Related papers

Automatic Bias Detection in Source Code Review [2.3480418671346164]
We propose a controlled experiment to detect potentially biased outcomes in code reviews by observing how reviewers interact with the code. We employ the "spotlight model of attention", a cognitive framework where a reviewer's gaze is tracked to determine their focus areas on the review screen. We plan to analyze the sequence of gaze focus using advanced sequence modeling techniques, including Markov Models, Recurrent Neural Networks (RNNs), and Conditional Random Fields (CRF)
arXiv Detail & Related papers (2025-04-25T16:01:52Z)
LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Variations in Relevance Judgments and the Shelf Life of Test Collections [50.060833338921945]
paradigm shift towards neural retrieval models affected the characteristics of modern test collections. We reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings. We observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers.
arXiv Detail & Related papers (2025-02-28T10:46:56Z)
A Comparative Quality Metric for Untargeted Fuzzing with Logic State Coverage [2.9914612342004503]
We propose logic state coverage as a proxy metric to count observed interesting behaviors. A logic state distinguishes less repetitive (i.e., more interesting) behaviors in a finer granularity, making the amount of logic state coverage reliably proportional to the number of observed interesting behaviors.
arXiv Detail & Related papers (2024-09-23T13:08:17Z)
Comment on Revisiting Neural Program Smoothing for Fuzzing [34.32355705821806]
MLFuzz, a work accepted at ACM FSE 2023, revisits the performance of a machine learning-based fuzzer, NEUZZ. We demonstrate that its main conclusion is entirely wrong due to several fatal bugs in the implementation and wrong evaluation setups.
arXiv Detail & Related papers (2024-09-06T16:07:22Z)
No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
Testing the Consistency of Performance Scores Reported for Binary Classification Problems [0.0]
We introduce numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup. We demonstrate how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields. To benefit the scientific community, we have made the consistency tests available in an open-source Python package.
arXiv Detail & Related papers (2023-10-19T07:04:29Z)
Too Good To Be True: performance overestimation in (re)current practices for Human Activity Recognition [49.1574468325115]
sliding windows for data segmentation followed by standard random k-fold cross validation produce biased results. It is important to raise awareness in the scientific community about this problem, whose negative effects are being overlooked. Several experiments with different types of datasets and different types of classification models allow us to exhibit the problem and show it persists independently of the method or dataset.
arXiv Detail & Related papers (2023-10-18T13:24:05Z)
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
Detecting Rewards Deterioration in Episodic Reinforcement Learning [63.49923393311052]
In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible. We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov. We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power.
arXiv Detail & Related papers (2020-10-22T12:45:55Z)
How Useful are Reviews for Recommendation? A Critical Review and Potential Improvements [8.471274313213092]
We investigate a growing body of work that seeks to improve recommender systems through the use of review text. Our initial findings reveal several discrepancies in reported results, partly due to copying results across papers despite changes in experimental settings or data pre-processing. Further investigation calls for discussion on a much larger problem about the "importance" of user reviews for recommendation.
arXiv Detail & Related papers (2020-05-25T16:30:05Z)
Showing Your Work Doesn't Always Work [73.63200097493576]
"Show Your Work: Improved Reporting of Experimental Results" advocates for reporting the expected validation effectiveness of the best-tuned model. We analytically show that their estimator is biased and uses error-prone assumptions. We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation.
arXiv Detail & Related papers (2020-04-28T17:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.