Is this model reliable for everyone? Testing for strong calibration
- URL: http://arxiv.org/abs/2307.15247v1
- Date: Fri, 28 Jul 2023 00:59:14 GMT
- Title: Is this model reliable for everyone? Testing for strong calibration
- Authors: Jean Feng, Alexej Gossmann, Romain Pirracchio, Nicholas Petrick, Gene
Pennello, Berkman Sahiner
- Abstract summary: In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup.
The task of auditing a model for strong calibration is well-known to be difficult due to the sheer number of potential subgroups.
Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal.
- Score: 4.893345190925178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In a well-calibrated risk prediction model, the average predicted probability
is close to the true event rate for any given subgroup. Such models are
reliable across heterogeneous populations and satisfy strong notions of
algorithmic fairness. However, the task of auditing a model for strong
calibration is well-known to be difficult -- particularly for machine learning
(ML) algorithms -- due to the sheer number of potential subgroups. As such,
common practice is to only assess calibration with respect to a few predefined
subgroups. Recent developments in goodness-of-fit testing offer potential
solutions but are not designed for settings with weak signal or where the
poorly calibrated subgroup is small, as they either overly subdivide the data
or fail to divide the data at all. We introduce a new testing procedure based
on the following insight: if we can reorder observations by their expected
residuals, there should be a change in the association between the predicted
and observed residuals along this sequence if a poorly calibrated subgroup
exists. This lets us reframe the problem of calibration testing into one of
changepoint detection, for which powerful methods already exist. We begin with
introducing a sample-splitting procedure where a portion of the data is used to
train a suite of candidate models for predicting the residual, and the
remaining data are used to perform a score-based cumulative sum (CUSUM) test.
To further improve power, we then extend this adaptive CUSUM test to
incorporate cross-validation, while maintaining Type I error control under
minimal assumptions. Compared to existing methods, the proposed procedure
consistently achieved higher power in simulation studies and more than doubled
the power when auditing a mortality risk prediction model.
Related papers
- Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting [55.17761802332469]
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample.
Prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications.
We propose an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples.
arXiv Detail & Related papers (2024-03-18T05:49:45Z) - Domain-adaptive and Subgroup-specific Cascaded Temperature Regression
for Out-of-distribution Calibration [16.930766717110053]
We propose a novel meta-set-based cascaded temperature regression method for post-hoc calibration.
We partition each meta-set into subgroups based on predicted category and confidence level, capturing diverse uncertainties.
A regression network is then trained to derive category-specific and confidence-level-specific scaling, achieving calibration across meta-sets.
arXiv Detail & Related papers (2024-02-14T14:35:57Z) - Calibration tests beyond classification [30.616624345970973]
Most supervised machine learning tasks are subject to irreducible prediction errors.
Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets.
Calibrated models guarantee that the predictions are neither over- nor under-confident.
arXiv Detail & Related papers (2022-10-21T09:49:57Z) - Fair admission risk prediction with proportional multicalibration [0.16249424686052708]
Multicalibration constrains calibration error among flexibly-defined subpopulations.
It is possible for a decision-maker to learn to trust or distrust model predictions for specific groups.
We propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins.
arXiv Detail & Related papers (2022-09-29T08:15:29Z) - T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem.
detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.
We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Individual Calibration with Randomized Forecasting [116.2086707626651]
We show that calibration for individual samples is possible in the regression setup if the predictions are randomized.
We design a training objective to enforce individual calibration and use it to train randomized regression functions.
arXiv Detail & Related papers (2020-06-18T05:53:10Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.