Making and Evaluating Calibrated Forecasts
- URL: http://arxiv.org/abs/2510.06388v1
- Date: Tue, 07 Oct 2025 19:11:03 GMT
- Title: Making and Evaluating Calibrated Forecasts
- Authors: Yuxuan Lu, Yifan Wu, Jason Hartline, Lunjia Hu,
- Abstract summary: We introduce a perfectly truthful calibration measure for multi-class prediction tasks.<n>We mathematically prove and empirically verify that our calibration measure exhibits superior robustness.<n>This result addresses the non-robustness issue of binned ECE.
- Score: 10.153382419318023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.
Related papers
- Uncertainty-Aware Post-Hoc Calibration: Mitigating Confidently Incorrect Predictions Beyond Calibration Metrics [6.9681910774977815]
This paper presents a post-hoc calibration framework to enhance calibration quality and uncertainty-aware decision-making.<n>A comprehensive evaluation is conducted using calibration metrics, uncertainty-aware performance measures, and empirical conformal coverage.<n> Experiments show that the proposed method achieves lower confidently incorrect predictions, and competitive Expected Error compared with isotonic and focal-loss baselines.
arXiv Detail & Related papers (2025-10-19T23:55:36Z) - A Perfectly Truthful Calibration Measure [14.052397440160568]
We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB)<n>ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal)
arXiv Detail & Related papers (2025-08-18T17:09:34Z) - Measuring Informativeness Gap of (Mis)Calibrated Predictors [15.651406777700517]
In many applications, decision-makers must choose between multiple predictive models that may all be miscalibrated.<n>Our framework strictly generalizes U-Calibration [KLST-23] and Decision Loss [HW-24], which compare a miscalibrated predictor to its calibrated counterpart.<n>Our second contribution is a dual characterization of the informativeness gap, which gives rise to a natural informativeness measure.
arXiv Detail & Related papers (2025-07-16T10:01:22Z) - Rethinking Early Stopping: Refine, Then Calibrate [49.966899634962374]
We present a novel variational formulation of the calibration-refinement decomposition.<n>We provide theoretical and empirical evidence that calibration and refinement errors are not minimized simultaneously during training.
arXiv Detail & Related papers (2025-01-31T15:03:54Z) - Truthfulness of Calibration Measures [18.21682539787221]
A calibration measure is said to be truthful if the forecaster minimizes expected penalty by predicting the conditional expectation of the next outcome.
This makes it an essential desideratum for calibration measures, alongside typical requirements, such as soundness and completeness.
We introduce a new calibration measure termed the Subsampled Smooth Error (SSCE) under which truthful prediction is optimal up to a constant multiplicative factor.
arXiv Detail & Related papers (2024-07-19T02:07:55Z) - Towards Certification of Uncertainty Calibration under Adversarial Attacks [96.48317453951418]
We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations.<n>We propose novel calibration attacks and demonstrate how they can improve model calibration through textitadversarial calibration training
arXiv Detail & Related papers (2024-05-22T18:52:09Z) - Calibration by Distribution Matching: Trainable Kernel Calibration
Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression.
These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization.
We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z) - Boldness-Recalibration for Binary Event Predictions [0.0]
Ideally, probability predictions are (i) well calibrated, (ii) accurate, and (iii) bold, i.e., spread out enough to be informative for decision making.
There is a fundamental tension between calibration and boldness, since calibration metrics can be high when predictions are overly cautious, i.e., non-bold.
The purpose of this work is to develop a Bayesian model selection-based approach to assess calibration, and a strategy for boldness-recalibration.
arXiv Detail & Related papers (2023-05-05T18:14:47Z) - T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem.
detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.
We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z) - Unsupervised Calibration under Covariate Shift [92.02278658443166]
We introduce the problem of calibration under domain shift and propose an importance sampling based approach to address it.
We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.
arXiv Detail & Related papers (2020-06-29T21:50:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.