Related papers: Hazards in Deep Learning Testing: Prevalence, Impact and Recommendations

Hazards in Deep Learning Testing: Prevalence, Impact and Recommendations

URL: http://arxiv.org/abs/2309.05381v1
Date: Mon, 11 Sep 2023 11:05:34 GMT
Title: Hazards in Deep Learning Testing: Prevalence, Impact and Recommendations
Authors: Salah Ghamizi, Maxime Cordy, Yuejun Guo, Mike Papadakis, And Yves Le Traon
Abstract summary: We identify 10 commonly adopted empirical evaluation hazards that may significantly impact experimental results. Our findings indicate that all 10 hazards have the potential to invalidate experimental findings. We propose a point set of 10 good empirical practices that has the potential to mitigate the impact of the hazards.
Score: 17.824339932321788
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Much research on Machine Learning testing relies on empirical studies that evaluate and show their potential. However, in this context empirical results are sensitive to a number of parameters that can adversely impact the results of the experiments and potentially lead to wrong conclusions (Type I errors, i.e., incorrectly rejecting the Null Hypothesis). To this end, we survey the related literature and identify 10 commonly adopted empirical evaluation hazards that may significantly impact experimental results. We then perform a sensitivity analysis on 30 influential studies that were published in top-tier SE venues, against our hazard set and demonstrate their criticality. Our findings indicate that all 10 hazards we identify have the potential to invalidate experimental findings, such as those made by the related literature, and should be handled properly. Going a step further, we propose a point set of 10 good empirical practices that has the potential to mitigate the impact of the hazards. We believe our work forms the first step towards raising awareness of the common pitfalls and good practices within the software engineering community and hopefully contribute towards setting particular expectations for empirical research in the field of deep learning testing.

Related papers

Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences [56.23412698865433]
We focus on causal inferences on a target experiment with unlabeled factual outcomes, retrieved by a predictive model fine-tuned on a labeled similar experiment. First, we show that factual outcome estimation via Empirical Risk Minimization (ERM) may fail to yield valid causal inferences on the target population. We propose Deconfounded Empirical Risk Minimization (DERM), a new simple learning procedure minimizing the risk over a fictitious target population.
arXiv Detail & Related papers (2025-02-10T10:52:17Z)
Contexts Matter: An Empirical Study on Contextual Influence in Fairness Testing for Deep Learning Systems [3.077531983369872]
We aim to understand how varying contexts affect fairness testing outcomes. Our results show that different context types and settings generally lead to a significant impact on the testing.
arXiv Detail & Related papers (2024-08-12T12:36:06Z)
Design Principles for Falsifiable, Replicable and Reproducible Empirical ML Research [2.3265565167163906]
Empirical research plays a fundamental role in the machine learning domain. We propose a model for the empirical research process, accompanied by guidelines to uphold the validity of empirical research.
arXiv Detail & Related papers (2024-05-28T11:37:59Z)
Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations. We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z)
On (Mis)perceptions of testing effectiveness: an empirical study [1.8026347864255505]
This research aims to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. In the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results of the replicated study confirm the findings of the original study and suggest that participants' perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.
arXiv Detail & Related papers (2024-02-11T14:50:01Z)
A Double Machine Learning Approach to Combining Experimental and Observational Data [59.29868677652324]
We propose a double machine learning approach to combine experimental and observational studies. Our framework tests for violations of external validity and ignorability under milder assumptions.
arXiv Detail & Related papers (2023-07-04T02:53:11Z)
Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent. Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally. We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z)
SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data [83.50281440043241]
We study the problem of inferring heterogeneous treatment effects from time-to-event data. We propose a novel deep learning method for treatment-specific hazard estimation based on balancing representations.
arXiv Detail & Related papers (2021-10-26T20:13:17Z)
An introduction to causal reasoning in health analytics [2.199093822766999]
We will try to highlight some of the drawbacks that may arise in traditional machine learning and statistical approaches to analyze the observational data. We will demonstrate the applications of causal inference in tackling some common machine learning issues.
arXiv Detail & Related papers (2021-05-10T20:25:56Z)
Enabling Counterfactual Survival Analysis with Balanced Representations [64.17342727357618]
Survival data are frequently encountered across diverse medical applications, i.e., drug development, risk profiling, and clinical trials. We propose a theoretically grounded unified framework for counterfactual inference applicable to survival outcomes.
arXiv Detail & Related papers (2020-06-14T01:15:00Z)
A Survey on Causal Inference [64.45536158710014]
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics. Various causal effect estimation methods for observational data have sprung up.
arXiv Detail & Related papers (2020-02-05T21:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.