Related papers: Policy Testing with MDPFuzz (Replicability Study)

Policy Testing with MDPFuzz (Replicability Study)

URL: http://arxiv.org/abs/2502.19116v1
Date: Wed, 26 Feb 2025 13:11:52 GMT
Title: Policy Testing with MDPFuzz (Replicability Study)
Authors: Quentin Mazouni, Helge Spieker, Arnaud Gotlieb, Mathieu Acher,
Abstract summary: We verify some of the key findings of the original paper and explore the limits of MDPFuzz through reproduction and replication.<n>We find that in most cases, the aforementioned ablated Fuzzer outperforms MDPFuzz, and conclude that the coverage model proposed does not lead to finding more faults.
Score: 13.133263651395865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, following tremendous achievements in Reinforcement Learning, a great deal of interest has been devoted to ML models for sequential decision-making. Together with these scientific breakthroughs/advances, research has been conducted to develop automated functional testing methods for finding faults in black-box Markov decision processes. Pang et al. (ISSTA 2022) presented a black-box fuzz testing framework called MDPFuzz. The method consists of a fuzzer whose main feature is to use Gaussian Mixture Models (GMMs) to compute coverage of the test inputs as the likelihood to have already observed their results. This guidance through coverage evaluation aims at favoring novelty during testing and fault discovery in the decision model. Pang et al. evaluated their work with four use cases, by comparing the number of failures found after twelve-hour testing campaigns with or without the guidance of the GMMs (ablation study). In this paper, we verify some of the key findings of the original paper and explore the limits of MDPFuzz through reproduction and replication. We re-implemented the proposed methodology and evaluated our replication in a large-scale study that extends the original four use cases with three new ones. Furthermore, we compare MDPFuzz and its ablated counterpart with a random testing baseline. We also assess the effectiveness of coverage guidance for different parameters, something that has not been done in the original evaluation. Despite this parameter analysis and unlike Pang et al.'s original conclusions, we find that in most cases, the aforementioned ablated Fuzzer outperforms MDPFuzz, and conclude that the coverage model proposed does not lead to finding more faults.

Related papers

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [67.67725938962798]
Pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks.<n>We introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation.<n>We show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary.
arXiv Detail & Related papers (2025-07-14T17:55:15Z)
Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework [77.45983464131977]
We focus on how likely it is that a RAG model's prediction is incorrect, resulting in uncontrollable risks in real-world applications.<n>Our research identifies two critical latent factors affecting RAG's confidence in its predictions.<n>We develop a counterfactual prompting framework that induces the models to alter these factors and analyzes the effect on their answers.
arXiv Detail & Related papers (2024-09-24T14:52:14Z)
Two new feature selection methods based on learn-heuristic techniques for breast cancer prediction: A comprehensive analysis [6.796017024594715]
We suggest two novel feature selection (FS) methods based upon an imperialist competitive algorithm (ICA) and a bat algorithm (BA) This study aims to enhance diagnostic models' efficiency and present a comprehensive analysis to help clinical physicians make much more precise and reliable decisions than before.
arXiv Detail & Related papers (2024-07-19T19:07:53Z)
Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling [82.36856860383291]
We demonstrate that accurate estimation of the score function is achievable by training a standard deep ReLU neural network. We establish bounds on the error rate of recovering causal relationships using the score-matching-based causal discovery method.
arXiv Detail & Related papers (2023-10-27T13:09:56Z)
A Semi-Bayesian Nonparametric Estimator of the Maximum Mean Discrepancy Measure: Applications in Goodness-of-Fit Testing and Generative Adversarial Networks [3.623570119514559]
We propose a semi-Bayesian nonparametric (semi-BNP) procedure for the goodness-of-fit (GOF) test. Our method introduces a novel Bayesian estimator for the maximum mean discrepancy (MMD) measure. We demonstrate that our proposed test outperforms frequentist MMD-based methods by achieving a lower false rejection and acceptance rate of the null hypothesis.
arXiv Detail & Related papers (2023-03-05T10:36:21Z)
In Search of Insights, Not Magic Bullets: Towards Demystification of the Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z)
Machine Learning Testing in an ADAS Case Study Using Simulation-Integrated Bio-Inspired Search-Based Testing [7.5828169434922]
Deeper generates failure-revealing test scenarios for testing a deep neural network-based lane-keeping system. In the newly proposed version, we utilize a new set of bio-inspired search algorithms, genetic algorithm (GA), $(mu+lambda)$ and $(mu,lambda)$ evolution strategies (ES), and particle swarm optimization (PSO) Our evaluation shows the newly proposed test generators in Deeper represent a considerable improvement on the previous version.
arXiv Detail & Related papers (2022-03-22T20:27:40Z)
MDPFuzz: Testing Models Solving Markov Decision Processes [10.53962813929928]
We present MDPFuzz, the first blackbox fuzz testing framework for models solving Markov decision process (MDP) MDPFuzz forms testing oracles by checking whether the target model enters abnormal and dangerous states. We show inspiring findings that crash-triggering states, though they look normal, induce distinct neuron activation patterns compared with normal states.
arXiv Detail & Related papers (2021-12-06T06:35:55Z)
The MultiBERTs: BERT Reproductions for Robustness Analysis [86.29162676103385]
Re-running pretraining can lead to substantially different conclusions about performance. We introduce MultiBERTs: a set of 25 BERT-base checkpoints. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures.
arXiv Detail & Related papers (2021-06-30T15:56:44Z)
What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness. We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z)
Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak. Standard methods struggle to accommodate the partial observability and sparse data common at finer scales. We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.