A Call to Reflect on Evaluation Practices for Failure Detection in Image
Classification
- URL: http://arxiv.org/abs/2211.15259v2
- Date: Wed, 5 Apr 2023 08:39:40 GMT
- Title: A Call to Reflect on Evaluation Practices for Failure Detection in Image
Classification
- Authors: Paul F. Jaeger, Carsten T. L\"uth, Lukas Klein and Till J. Bungert
- Abstract summary: We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions.
The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
- Score: 0.491574468325115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reliable application of machine learning-based decision systems in the wild
is one of the major challenges currently investigated by the field. A large
portion of established approaches aims to detect erroneous predictions by means
of assigning confidence scores. This confidence may be obtained by either
quantifying the model's predictive uncertainty, learning explicit scoring
functions, or assessing whether the input is in line with the training
distribution. Curiously, while these approaches all state to address the same
eventual goal of detecting failures of a classifier upon real-life application,
they currently constitute largely separated research fields with individual
evaluation protocols, which either exclude a substantial part of relevant
methods or ignore large parts of relevant failure sources. In this work, we
systematically reveal current pitfalls caused by these inconsistencies and
derive requirements for a holistic and realistic evaluation of failure
detection. To demonstrate the relevance of this unified perspective, we present
a large-scale empirical study for the first time enabling benchmarking
confidence scoring functions w.r.t all relevant methods and failure sources.
The revelation of a simple softmax response baseline as the overall best
performing method underlines the drastic shortcomings of current evaluation in
the abundance of publicized research on confidence scoring. Code and trained
models are at https://github.com/IML-DKFZ/fd-shifts.
Related papers
- A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework in Large Language Models (LLMs)
We derive novel metrics with high-probability guarantees concerning the output distribution of a model.
Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z) - Revisiting Confidence Estimation: Towards Reliable Failure Prediction [53.79160907725975]
We find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors.
We propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance.
arXiv Detail & Related papers (2024-03-05T11:44:14Z) - Learning-Based Approaches to Predictive Monitoring with Conformal
Statistical Guarantees [2.1684857243537334]
This tutorial focuses on efficient methods to predictive monitoring (PM)
PM is the problem of detecting future violations of a given requirement from the current state of a system.
We present a general and comprehensive framework summarizing our approach to the predictive monitoring of CPSs.
arXiv Detail & Related papers (2023-12-04T15:16:42Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - Large Class Separation is not what you need for Relational
Reasoning-based OOD Detection [12.578844450586]
Out-Of-Distribution (OOD) detection methods provide a solution by identifying semantic novelty.
Most of these methods leverage a learning stage on the known data, which means training (or fine-tuning) a model to capture the concept of normality.
A viable alternative is that of evaluating similarities in the embedding space produced by large pre-trained models without any further learning effort.
arXiv Detail & Related papers (2023-07-12T14:10:15Z) - Robust Deep Learning for Autonomous Driving [0.0]
We introduce a new criterion to reliably estimate model confidence: the true class probability ( TCP)
Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model, introducing a specific learning scheme adapted to this context.
We tackle the challenge of jointly detecting misclassification and out-of-distributions samples by introducing a new uncertainty measure based on evidential models and defined on the simplex.
arXiv Detail & Related papers (2022-11-14T22:07:11Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - An Effective Baseline for Robustness to Distributional Shift [5.627346969563955]
Refraining from confidently predicting when faced with categories of inputs different from those seen during training is an important requirement for the safe deployment of deep learning systems.
We present a simple, but highly effective approach to deal with out-of-distribution detection that uses the principle of abstention.
arXiv Detail & Related papers (2021-05-15T00:46:11Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep
Learning [70.72363097550483]
In this study, we focus on in-domain uncertainty for image classification.
To provide more insight in this study, we introduce the deep ensemble equivalent score (DEE)
arXiv Detail & Related papers (2020-02-15T23:28:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.