Related papers: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation

URL: http://arxiv.org/abs/2010.16061v1
Date: Sun, 11 Oct 2020 02:15:11 GMT
Title: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation
Authors: David M. W. Powers
Abstract summary: Measures such as Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance.
Score: 3.7819322027528113
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.

Related papers

Always Tell Me The Odds: Fine-grained Conditional Probability Estimation [37.950889606305836]
We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context.<n>We show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.
arXiv Detail & Related papers (2025-05-02T21:33:18Z)
Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified? [11.534630666670568]
Conventional wisdom suggests that models relying on spurious correlations will fail to generalize out-of-distribution. We show that many widely used benchmarks for evaluating robustness to spurious correlations are misspecified. We highlight the need to rethink how robustness to spurious correlations is assessed, identify well-specified benchmarks the field should prioritize, and enumerate strategies for designing future benchmarks that meaningfully reflect robustness under distribution shift.
arXiv Detail & Related papers (2025-03-31T19:50:04Z)
Probabilistic Modeling of Disparity Uncertainty for Robust and Efficient Stereo Matching [61.73532883992135]
We propose a new uncertainty-aware stereo matching framework. We adopt Bayes risk as the measurement of uncertainty and use it to separately estimate data and model uncertainty.
arXiv Detail & Related papers (2024-12-24T23:28:20Z)
On Information-Theoretic Measures of Predictive Uncertainty [5.8034373350518775]
Despite its significance, a consensus on the correct measurement of predictive uncertainty remains elusive. Our proposed framework categorizes predictive uncertainty measures according to two factors: (I) The predicting model (II) The approximation of the true predictive distribution. We empirically evaluate these measures in typical uncertainty estimation settings, such as misclassification detection, selective prediction, and out-of-distribution detection.
arXiv Detail & Related papers (2024-10-14T17:52:18Z)
Beyond Calibration: Assessing the Probabilistic Fit of Neural Regressors via Conditional Congruence [2.2359781747539396]
Deep networks often suffer from overconfidence and misaligned predictive distributions. We introduce a metric, Conditional Congruence Error (CCE), that uses conditional kernel mean embeddings to estimate the distance between the learned predictive distribution and the empirical, conditional distribution in a dataset. We show that using to measure congruence 1) accurately quantifies misalignment between distributions when the data generating process is known, 2) effectively scales to real-world, high dimensional image regression tasks, and 3) can be used to gauge model reliability on unseen instances.
arXiv Detail & Related papers (2024-05-20T23:30:07Z)
Revisiting Confidence Estimation: Towards Reliable Failure Prediction [53.79160907725975]
We find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance.
arXiv Detail & Related papers (2024-03-05T11:44:14Z)
Standardized Interpretable Fairness Measures for Continuous Risk Scores [4.192037827105842]
We propose a standardized version of fairness measures for continuous scores with a reasonable interpretation based on the Wasserstein distance. Our measures are easily computable and well suited for quantifying and interpreting the strength of group disparities as well as for comparing biases across different models, datasets, or time points.
arXiv Detail & Related papers (2023-08-22T12:01:49Z)
Model-free generalized fiducial inference [0.0]
I propose and develop ideas for a model-free statistical framework for imprecise probabilistic prediction inference. This framework facilitates uncertainty quantification in the form of prediction sets that offer finite sample control of type 1 errors. I consider the theoretical and empirical properties of a precise probabilistic approximation to the model-free imprecise framework.
arXiv Detail & Related papers (2023-07-24T01:58:48Z)
Evaluating Probabilistic Classifiers: The Triptych [62.997667081978825]
We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance. The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value.
arXiv Detail & Related papers (2023-01-25T19:35:23Z)
Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
Uncertainty Estimation for Heatmap-based Landmark Localization [4.673063715963989]
We propose Quantile Binning, a data-driven method to categorise predictions by uncertainty with estimated error bounds. We demonstrate this framework by comparing and contrasting three uncertainty measures. We conclude by illustrating how filtering out gross mispredictions caught in our Quantile Bins significantly improves the proportion of predictions under an acceptable error threshold.
arXiv Detail & Related papers (2022-03-04T14:40:44Z)
DEUP: Direct Epistemic Uncertainty Prediction [56.087230230128185]
Epistemic uncertainty is part of out-of-sample prediction error due to the lack of knowledge of the learner. We propose a principled approach for directly estimating epistemic uncertainty by learning to predict generalization error and subtracting an estimate of aleatoric uncertainty.
arXiv Detail & Related papers (2021-02-16T23:50:35Z)
Learning Accurate Dense Correspondences and When to Trust Them [161.76275845530964]
We aim to estimate a dense flow field relating two images, coupled with a robust pixel-wise confidence map. We develop a flexible probabilistic approach that jointly learns the flow prediction and its uncertainty. Our approach obtains state-of-the-art results on challenging geometric matching and optical flow datasets.
arXiv Detail & Related papers (2021-01-05T18:54:11Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.