Stop Measuring Calibration When Humans Disagree
- URL: http://arxiv.org/abs/2210.16133v1
- Date: Fri, 28 Oct 2022 14:01:32 GMT
- Title: Stop Measuring Calibration When Humans Disagree
- Authors: Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernandez
- Abstract summary: We show that measuring calibration to human majority given inherent disagreements is theoretically problematic.
We derive several instance-level measures of calibration that capture key statistical properties of human judgements.
- Score: 25.177984280183402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Calibration is a popular framework to evaluate whether a classifier knows
when it does not know - i.e., its predictive probabilities are a good
indication of how likely a prediction is to be correct. Correctness is commonly
estimated against the human majority class. Recently, calibration to human
majority has been measured on tasks where humans inherently disagree about
which class applies. We show that measuring calibration to human majority given
inherent disagreements is theoretically problematic, demonstrate this
empirically on the ChaosNLI dataset, and derive several instance-level measures
of calibration that capture key statistical properties of human judgements -
class frequency, ranking and entropy.
Related papers
- Rethinking Early Stopping: Refine, Then Calibrate [49.966899634962374]
We show that calibration error and refinement error are not minimized simultaneously during training.
We introduce a new metric for early stopping and hyper parameter tuning that makes it possible to minimize refinement error during training.
Our method integrates seamlessly with any architecture and consistently improves performance across diverse classification tasks.
arXiv Detail & Related papers (2025-01-31T15:03:54Z) - Calibration through the Lens of Interpretability [3.9962751777898955]
calibration is a frequently invoked concept when useful label probability estimates are required on top of classification accuracy.
In this work, we initiate an axiomatic study of the notion of calibration.
We catalogue desirable properties of calibrated models as well as corresponding evaluation metrics and analyze their feasibility and correspondences.
arXiv Detail & Related papers (2024-12-01T19:28:16Z) - Truthfulness of Calibration Measures [18.21682539787221]
A calibration measure is said to be truthful if the forecaster minimizes expected penalty by predicting the conditional expectation of the next outcome.
This makes it an essential desideratum for calibration measures, alongside typical requirements, such as soundness and completeness.
We introduce a new calibration measure termed the Subsampled Smooth Error (SSCE) under which truthful prediction is optimal up to a constant multiplicative factor.
arXiv Detail & Related papers (2024-07-19T02:07:55Z) - Orthogonal Causal Calibration [55.28164682911196]
We prove generic upper bounds on the calibration error of any causal parameter estimate $theta$ with respect to any loss $ell$.
We use our bound to analyze the convergence of two sample splitting algorithms for causal calibration.
arXiv Detail & Related papers (2024-06-04T03:35:25Z) - Calibration by Distribution Matching: Trainable Kernel Calibration
Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression.
These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization.
We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z) - T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem.
detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.
We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z) - Estimating Expected Calibration Errors [1.52292571922932]
Uncertainty in probabilistics predictions is a key concern when models are used to support human decision making.
Most models are not intrinsically well calibrated, meaning that their decision scores are not consistent with posterior probabilities.
We build an empirical procedure to quantify the quality of $ECE$ estimators, and use it to decide which estimator should be used in practice for different settings.
arXiv Detail & Related papers (2021-09-08T08:00:23Z) - Unsupervised Calibration under Covariate Shift [92.02278658443166]
We introduce the problem of calibration under domain shift and propose an importance sampling based approach to address it.
We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.
arXiv Detail & Related papers (2020-06-29T21:50:07Z) - Individual Calibration with Randomized Forecasting [116.2086707626651]
We show that calibration for individual samples is possible in the regression setup if the predictions are randomized.
We design a training objective to enforce individual calibration and use it to train randomized regression functions.
arXiv Detail & Related papers (2020-06-18T05:53:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.