Related papers: Testing Calibration in Nearly-Linear Time

Testing Calibration in Nearly-Linear Time

URL: http://arxiv.org/abs/2402.13187v2
Date: Fri, 21 Jun 2024 17:27:22 GMT
Title: Testing Calibration in Nearly-Linear Time
Authors: Lunjia Hu, Arun Jambulapati, Kevin Tian, Chutong Yang,
Abstract summary: We focus on the algorithmic study of calibration through the lens of property testing. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph. We present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.
Score: 14.099477870728595
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring $\Omega(n^\omega)$ time, where $\omega > 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.

Related papers

CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning [62.56541355300587]
We introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths.<n>Within this framework, we propose CarBoN, a two-phase method that first explores the solution space and then learns a calibration of the logits.<n>Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4times$ fewer rollouts to reach the same accuracy.
arXiv Detail & Related papers (2025-10-17T14:04:37Z)
Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling [1.342834401139078]
We develop a well-calibrated predictor whosetext weight depends on the angle $angle(hatw, w_star)$ between the estimator $hatw$ and the true linear weight $w_star$. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions.
arXiv Detail & Related papers (2025-02-21T01:24:27Z)
Orthogonal Causal Calibration [55.28164682911196]
We develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings.
arXiv Detail & Related papers (2024-06-04T03:35:25Z)
On the Distance from Calibration in Sequential Prediction [4.14360329494344]
We study a sequential binary prediction setting where the forecaster is evaluated in terms of the calibration distance. The calibration distance is a natural and intuitive measure of deviation from perfect calibration. We prove that there is a forecasting algorithm that achieves an $O(sqrtT)$ calibration distance in expectation on an adversarially chosen sequence of $T$ binary outcomes.
arXiv Detail & Related papers (2024-02-12T07:37:19Z)
On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration. Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z)
A Consistent and Differentiable Lp Canonical Calibration Error Estimator [21.67616079217758]
Deep neural networks are poorly calibrated and tend to output overconfident predictions. We propose a low-bias, trainable calibration error estimator based on Dirichlet kernel density estimates. Our method has a natural choice of kernel, and can be used to generate consistent estimates of other quantities.
arXiv Detail & Related papers (2022-10-13T15:11:11Z)
Class-wise and reduced calibration methods [0.0]
We show how a reduced calibration method transforms the original problem into a simpler one. Second, we propose class-wise calibration methods, based on building on a phenomenon called neural collapse. Applying the two methods together results in class-wise reduced calibration algorithms, which are powerful tools for reducing the prediction and per-class calibration errors.
arXiv Detail & Related papers (2022-10-07T17:13:17Z)
Modular Conformal Calibration [80.33410096908872]
We introduce a versatile class of algorithms for recalibration in regression. This framework allows one to transform any regression model into a calibrated probabilistic model. We conduct an empirical study of MCC on 17 regression datasets.
arXiv Detail & Related papers (2022-06-23T03:25:23Z)
T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z)
MBCT: Tree-Based Feature-Aware Binning for Individual Uncertainty Calibration [29.780204566046503]
We propose a feature-aware binning framework, called Multiple Boosting Trees (MBCT) Our MBCT is non-monotonic, and has the potential to improve order accuracy, due to its learnable binning scheme and the individual calibration. Results show that our method outperforms all competing models in terms of both calibration error and order accuracy.
arXiv Detail & Related papers (2022-02-09T08:59:16Z)
Top-label calibration [3.3504365823045044]
We study the problem of post-hoc calibration for multiclass classification, with an emphasis on histogram binning. We find that the popular notion of confidence calibration is not sufficiently strong -- there exist predictors that are not calibrated in any meaningful way but are perfectly confidence calibrated. We propose a closely related (but subtly different) notion, top-label calibration, that accurately captures the intuition and simplicity of confidence calibration, but addresses its drawbacks.
arXiv Detail & Related papers (2021-07-18T03:27:50Z)
Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms [65.09383385484007]
Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms. We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
arXiv Detail & Related papers (2020-11-10T11:36:30Z)
Uncertainty Quantification and Deep Ensembles [79.4957965474334]
We show that deep-ensembles do not necessarily lead to improved calibration properties. We show that standard ensembling methods, when used in conjunction with modern techniques such as mixup regularization, can lead to less calibrated models. This text examines the interplay between three of the most simple and commonly used approaches to leverage deep learning when data is scarce.
arXiv Detail & Related papers (2020-07-17T07:32:24Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.