Related papers: Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems

Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems

URL: http://arxiv.org/abs/2509.19419v1
Date: Tue, 23 Sep 2025 16:16:02 GMT
Title: Probabilistic Runtime Verification, Evaluation and Risk Assessment of Visual Deep Learning Systems
Authors: Birk Torpmann-Hagen, Pål Halvorsen, Michael A. Riegler, Dag Johansen,
Abstract summary: We propose a novel methodology for the verification, evaluation, and risk assessment of deep learning systems.<n>Our approach explicitly models the incidence of distributional shifts at runtime by estimating their probability from outputs of out-of-distribution detectors.<n>Our approach consistently outperforms conventional evaluation, with accuracy estimation errors typically ranging between 0.01 and 0.1.
Score: 3.9341402479278216
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite achieving excellent performance on benchmarks, deep neural networks often underperform in real-world deployment due to sensitivity to minor, often imperceptible shifts in input data, known as distributional shifts. These shifts are common in practical scenarios but are rarely accounted for during evaluation, leading to inflated performance metrics. To address this gap, we propose a novel methodology for the verification, evaluation, and risk assessment of deep learning systems. Our approach explicitly models the incidence of distributional shifts at runtime by estimating their probability from outputs of out-of-distribution detectors. We combine these estimates with conditional probabilities of network correctness, structuring them in a binary tree. By traversing this tree, we can compute credible and precise estimates of network accuracy. We assess our approach on five different datasets, with which we simulate deployment conditions characterized by differing frequencies of distributional shift. Our approach consistently outperforms conventional evaluation, with accuracy estimation errors typically ranging between 0.01 and 0.1. We further showcase the potential of our approach on a medical segmentation benchmark, wherein we apply our methods towards risk assessment by associating costs with tree nodes, informing cost-benefit analyses and value-judgments. Ultimately, our approach offers a robust framework for improving the reliability and trustworthiness of deep learning systems, particularly in safety-critical applications, by providing more accurate performance estimates and actionable risk assessments.

Related papers

Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z)
Evaluating Deep Neural Networks in Deployment (A Comparative and Replicability Study) [11.242083685224554]
Deep neural networks (DNNs) are increasingly used in safety-critical applications. We study recent approaches that have been proposed to evaluate the reliability of DNNs in deployment. We find that it is hard to run and reproduce the results for these approaches on their replication packages and even more difficult to run them on artifacts other than their own.
arXiv Detail & Related papers (2024-07-11T17:58:12Z)
Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I [39.92942310783174]
Large language models (LLMs) can generate relevance annotations at an enormous scale with relatively small computational costs. We propose two methods based on prediction-powered inference and conformal risk control. Our experimental results show that our CIs accurately capture both the variance and bias in evaluation.
arXiv Detail & Related papers (2024-07-02T17:44:00Z)
Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts [73.33395097728128]
We provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions.
arXiv Detail & Related papers (2023-04-19T17:38:42Z)
Free Lunch for Generating Effective Outlier Supervision [46.37464572099351]
We propose an ultra-effective method to generate near-realistic outlier supervision. Our proposed textttBayesAug significantly reduces the false positive rate over 12.50% compared with the previous schemes.
arXiv Detail & Related papers (2023-01-17T01:46:45Z)
Benchmarking common uncertainty estimation methods with histopathological images under domain shift and label noise [62.997667081978825]
In high-risk environments, deep learning models need to be able to judge their uncertainty and reject inputs when there is a significant chance of misclassification. We conduct a rigorous evaluation of the most commonly used uncertainty and robustness methods for the classification of Whole Slide Images. We observe that ensembles of methods generally lead to better uncertainty estimates as well as an increased robustness towards domain shifts and label noise.
arXiv Detail & Related papers (2023-01-03T11:34:36Z)
On the Practicality of Deterministic Epistemic Uncertainty [106.06571981780591]
deterministic uncertainty methods (DUMs) achieve strong performance on detecting out-of-distribution data. It remains unclear whether DUMs are well calibrated and can seamlessly scale to real-world applications.
arXiv Detail & Related papers (2021-07-01T17:59:07Z)
Quantifying Uncertainty in Deep Spatiotemporal Forecasting [67.77102283276409]
We describe two types of forecasting problems: regular grid-based and graph-based. We analyze UQ methods from both the Bayesian and the frequentist point view, casting in a unified framework via statistical decision theory. Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical computational trade-offs for different UQ methods.
arXiv Detail & Related papers (2021-05-25T14:35:46Z)
Learning Calibrated Uncertainties for Domain Shift: A Distributionally Robust Learning Approach [150.8920602230832]
We propose a framework for learning calibrated uncertainties under domain shifts. In particular, the density ratio estimation reflects the closeness of a target (test) sample to the source (training) distribution. We show that our proposed method generates calibrated uncertainties that benefit downstream tasks.
arXiv Detail & Related papers (2020-10-08T02:10:54Z)
Data-Driven Assessment of Deep Neural Networks with Random Input Uncertainty [14.191310794366075]
We develop a data-driven optimization-based method capable of simultaneously certifying the safety of network outputs and localizing them. We experimentally demonstrate the efficacy and tractability of the method on a deep ReLU network.
arXiv Detail & Related papers (2020-10-02T19:13:35Z)
Uncertainty-Based Out-of-Distribution Classification in Deep Reinforcement Learning [17.10036674236381]
Wrong predictions for out-of-distribution data can cause safety critical situations in machine learning systems. We propose a framework for uncertainty-based OOD classification: UBOOD. We show that UBOOD produces reliable classification results when combined with ensemble-based estimators.
arXiv Detail & Related papers (2019-12-31T09:52:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.