T-Cal: An optimal test for the calibration of predictive models
- URL: http://arxiv.org/abs/2203.01850v4
- Date: Tue, 5 Dec 2023 23:28:57 GMT
- Title: T-Cal: An optimal test for the calibration of predictive models
- Authors: Donghwan Lee, Xinmeng Huang, Hamed Hassani, Edgar Dobriban
- Abstract summary: We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem.
detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.
We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
- Score: 49.11538724574202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The prediction accuracy of machine learning methods is steadily increasing,
but the calibration of their uncertainty predictions poses a significant
challenge. Numerous works focus on obtaining well-calibrated predictive models,
but less is known about reliably assessing model calibration. This limits our
ability to know when algorithms for improving calibration have a real effect,
and when their improvements are merely artifacts due to random noise in finite
datasets. In this work, we consider detecting mis-calibration of predictive
models using a finite validation dataset as a hypothesis testing problem. The
null hypothesis is that the predictive model is calibrated, while the
alternative hypothesis is that the deviation from calibration is sufficiently
large.
We find that detecting mis-calibration is only possible when the conditional
probabilities of the classes are sufficiently smooth functions of the
predictions. When the conditional class probabilities are H\"older continuous,
we propose T-Cal, a minimax optimal test for calibration based on a debiased
plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further
propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We
verify our theoretical findings with a broad range of experiments, including
with several popular deep neural net architectures and several standard
post-hoc calibration methods. T-Cal is a practical general-purpose tool, which
-- combined with classical tests for discrete-valued predictors -- can be used
to test the calibration of virtually any probabilistic classification method.
Related papers
- Optimizing Calibration by Gaining Aware of Prediction Correctness [30.619608580138802]
Cross-Entropy (CE) loss is widely used for calibrator training, which enforces the model to increase confidence on the ground truth class.
We propose a new post-hoc calibration objective derived from the aim of calibration.
arXiv Detail & Related papers (2024-04-19T17:25:43Z) - Calibrated Uncertainty Quantification for Operator Learning via
Conformal Prediction [95.75771195913046]
We propose a risk-controlling quantile neural operator, a distribution-free, finite-sample functional calibration conformal prediction method.
We provide a theoretical calibration guarantee on the coverage rate, defined as the expected percentage of points on the function domain.
Empirical results on a 2D Darcy flow and a 3D car surface pressure prediction task validate our theoretical results.
arXiv Detail & Related papers (2024-02-02T23:43:28Z) - Sharp Calibrated Gaussian Processes [58.94710279601622]
State-of-the-art approaches for designing calibrated models rely on inflating the Gaussian process posterior variance.
We present a calibration approach that generates predictive quantiles using a computation inspired by the vanilla Gaussian process posterior variance.
Our approach is shown to yield a calibrated model under reasonable assumptions.
arXiv Detail & Related papers (2023-02-23T12:17:36Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - A Consistent and Differentiable Lp Canonical Calibration Error Estimator [21.67616079217758]
Deep neural networks are poorly calibrated and tend to output overconfident predictions.
We propose a low-bias, trainable calibration error estimator based on Dirichlet kernel density estimates.
Our method has a natural choice of kernel, and can be used to generate consistent estimates of other quantities.
arXiv Detail & Related papers (2022-10-13T15:11:11Z) - Calibrated Selective Classification [34.08454890436067]
We develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties.
We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model.
We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.
arXiv Detail & Related papers (2022-08-25T13:31:09Z) - Modular Conformal Calibration [80.33410096908872]
We introduce a versatile class of algorithms for recalibration in regression.
This framework allows one to transform any regression model into a calibrated probabilistic model.
We conduct an empirical study of MCC on 17 regression datasets.
arXiv Detail & Related papers (2022-06-23T03:25:23Z) - Posterior Probability Matters: Doubly-Adaptive Calibration for Neural Predictions in Online Advertising [29.80454356173723]
Field-level calibration is fine-grained and more practical.
AdaCalib learns an isotonic function family to calibrate model predictions.
Experiments verify that AdaCalib achieves significant improvement on calibration performance.
arXiv Detail & Related papers (2022-05-15T14:27:19Z) - Unsupervised Calibration under Covariate Shift [92.02278658443166]
We introduce the problem of calibration under domain shift and propose an importance sampling based approach to address it.
We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.
arXiv Detail & Related papers (2020-06-29T21:50:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.