Related papers: Probabilistic Scores of Classifiers, Calibration is not Enough

Probabilistic Scores of Classifiers, Calibration is not Enough

URL: http://arxiv.org/abs/2408.03421v1
Date: Tue, 6 Aug 2024 19:53:00 GMT
Title: Probabilistic Scores of Classifiers, Calibration is not Enough
Authors: Agathe Fernandes Machado, Arthur Charpentier, Emmanuel Flachaire, Ewen Gallic, François Hu,
Abstract summary: In binary classification tasks, accurate representation of probabilistic predictions is essential for various real-world applications. In this study, we highlight approaches that prioritize the alignment between predicted scores and true probability distributions. Our findings reveal limitations in traditional calibration metrics, which could undermine the reliability of predictive models for critical decision-making.
Score: 0.32985979395737786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In binary classification tasks, accurate representation of probabilistic predictions is essential for various real-world applications such as predicting payment defaults or assessing medical risks. The model must then be well-calibrated to ensure alignment between predicted probabilities and actual outcomes. However, when score heterogeneity deviates from the underlying data probability distribution, traditional calibration metrics lose reliability, failing to align score distribution with actual probabilities. In this study, we highlight approaches that prioritize optimizing the alignment between predicted scores and true probability distributions over minimizing traditional performance or calibration metrics. When employing tree-based models such as Random Forest and XGBoost, our analysis emphasizes the flexibility these models offer in tuning hyperparameters to minimize the Kullback-Leibler (KL) divergence between predicted and true distributions. Through extensive empirical analysis across 10 UCI datasets and simulations, we demonstrate that optimizing tree-based models based on KL divergence yields superior alignment between predicted scores and actual probabilities without significant performance loss. In real-world scenarios, the reference probability is determined a priori as a Beta distribution estimated through maximum likelihood. Conversely, minimizing traditional calibration metrics may lead to suboptimal results, characterized by notable performance declines and inferior KL values. Our findings reveal limitations in traditional calibration metrics, which could undermine the reliability of predictive models for critical decision-making.

Related papers

Enforcing tail calibration when training probabilistic forecast models [0.0]
We study how the loss function used to train probabilistic forecast models can be adapted to improve the reliability of forecasts made for extreme events.<n>We demonstrate that state-of-the-art models do not issue calibrated forecasts for extreme wind speeds, and that the calibration of forecasts for extreme events can be improved by suitable adaptations to the loss function during model training.
arXiv Detail & Related papers (2025-06-16T16:51:06Z)
Deep Probability Segmentation: Are segmentation models probability estimators? [0.7646713951724011]
We apply Calibrated Probability Estimation to segmentation tasks to evaluate its impact on model calibration. Results indicate that while CaPE improves calibration, its effect is less pronounced compared to classification tasks. We also investigated the influence of dataset size and bin optimization on the effectiveness of calibration.
arXiv Detail & Related papers (2024-09-19T07:52:19Z)
Beyond Calibration: Assessing the Probabilistic Fit of Neural Regressors via Conditional Congruence [2.2359781747539396]
Deep networks often suffer from overconfidence and misaligned predictive distributions. We introduce a metric, Conditional Congruence Error (CCE), that uses conditional kernel mean embeddings to estimate the distance between the learned predictive distribution and the empirical, conditional distribution in a dataset. We show that using to measure congruence 1) accurately quantifies misalignment between distributions when the data generating process is known, 2) effectively scales to real-world, high dimensional image regression tasks, and 3) can be used to gauge model reliability on unseen instances.
arXiv Detail & Related papers (2024-05-20T23:30:07Z)
Calibrating Neural Simulation-Based Inference with Differentiable Coverage Probability [50.44439018155837]
We propose to include a calibration term directly into the training objective of the neural model. By introducing a relaxation of the classical formulation of calibration error we enable end-to-end backpropagation. It is directly applicable to existing computational pipelines allowing reliable black-box posterior inference.
arXiv Detail & Related papers (2023-10-20T10:20:45Z)
When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting [69.30930115236228]
Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. We propose PROFHiT, a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy.
arXiv Detail & Related papers (2023-10-17T20:30:16Z)
Variational Inference with Coverage Guarantees in Simulation-Based Inference [18.818573945984873]
We propose Conformalized Amortized Neural Variational Inference (CANVI) CANVI constructs conformalized predictors based on each candidate, compares the predictors using a metric known as predictive efficiency, and returns the most efficient predictor. We prove lower bounds on the predictive efficiency of the regions produced by CANVI and explore how the quality of a posterior approximation relates to the predictive efficiency of prediction regions based on that approximation.
arXiv Detail & Related papers (2023-05-23T17:24:04Z)
Evaluating Probabilistic Classifiers: The Triptych [62.997667081978825]
We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance. The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value.
arXiv Detail & Related papers (2023-01-25T19:35:23Z)
Calibrated Selective Classification [34.08454890436067]
We develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties. We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model. We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.
arXiv Detail & Related papers (2022-08-25T13:31:09Z)
When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting [69.30930115236228]
Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. We propose PROFHiT, a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy.
arXiv Detail & Related papers (2022-06-16T06:13:53Z)
Evaluating probabilistic classifiers: Reliability diagrams and score decompositions revisited [68.8204255655161]
We introduce the CORP approach, which generates provably statistically Consistent, Optimally binned, and Reproducible reliability diagrams in an automated way. Corpor is based on non-parametric isotonic regression and implemented via the Pool-adjacent-violators (PAV) algorithm.
arXiv Detail & Related papers (2020-08-07T08:22:26Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.