Related papers: Mitigating Bias in Calibration Error Estimation

Mitigating Bias in Calibration Error Estimation

URL: http://arxiv.org/abs/2012.08668v2
Date: Wed, 24 Feb 2021 19:25:00 GMT
Title: Mitigating Bias in Calibration Error Estimation
Authors: Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, Michael C. Mozer
Abstract summary: We introduce a simulation framework that allows us to empirically show that ECE_bin can systematically underestimate or overestimate the true calibration error. We propose a simple alternative calibration error metric, ECE_sweep, in which the number of bins is chosen to be as large as possible.
Score: 28.46667300490605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building reliable machine learning systems requires that we correctly understand their level of confidence. Calibration measures the degree of accuracy in a model's confidence and most research in calibration focuses on techniques to improve an empirical estimate of calibration error, ECE_bin. We introduce a simulation framework that allows us to empirically show that ECE_bin can systematically underestimate or overestimate the true calibration error depending on the nature of model miscalibration, the size of the evaluation data set, and the number of bins. Critically, we find that ECE_bin is more strongly biased for perfectly calibrated models. We propose a simple alternative calibration error metric, ECE_sweep, in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function. Evaluating our measure on distributions fit to neural network confidence scores on CIFAR-10, CIFAR-100, and ImageNet, we show that ECE_sweep produces a less biased estimator of calibration error and therefore should be used by any researcher wishing to evaluate the calibration of models trained on similar datasets.

Related papers

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling [3.4580564656984736]
Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data. A new calibration metric ($TCE_bpm$), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. The effectiveness of our calibration method and metric are verified in real-world and simulated data.
arXiv Detail & Related papers (2024-12-14T03:04:05Z)
Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors [22.39558434131574]
We introduce the concept of consistency as an alternative perspective on model calibration. We propose a post-hoc calibration method called Consistency (CC) which adjusts confidence based on the model's consistency across inputs. We show that performing perturbations at the logit level significantly improves computational efficiency.
arXiv Detail & Related papers (2024-10-16T06:55:02Z)
Optimizing Estimators of Squared Calibration Errors in Classification [2.3020018305241337]
We propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors. Our approach advocates for a training-validation-testing pipeline when estimating a calibration error.
arXiv Detail & Related papers (2024-10-09T15:58:06Z)
Calibration by Distribution Matching: Trainable Kernel Calibration Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression. These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization. We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z)
TCE: A Test-Based Approach to Measuring Calibration Error [7.06037484978289]
We propose a new metric to measure the calibration error of probabilistic binary classifiers, called test-based calibration error (TCE) TCE incorporates a novel loss function based on a statistical test to examine the extent to which model predictions differ from probabilities estimated from data. We demonstrate properties of TCE through a range of experiments, including multiple real-world imbalanced datasets and ImageNet 1000.
arXiv Detail & Related papers (2023-06-25T21:12:43Z)
Calibration Error Estimation Using Fuzzy Binning [0.0]
We propose a Fuzzy Error metric (FCE) that utilizes a fuzzy binning approach to calculate calibration error. Our results show that FCE offers better calibration error estimation, especially in multi-class settings.
arXiv Detail & Related papers (2023-04-30T18:06:14Z)
Calibration of Neural Networks [77.34726150561087]
This paper presents a survey of confidence calibration problems in the context of neural networks. We analyze problem statement, calibration definitions, and different approaches to evaluation. Empirical experiments cover various datasets and models, comparing calibration methods according to different criteria.
arXiv Detail & Related papers (2023-03-19T20:27:51Z)
Enabling Calibration In The Zero-Shot Inference of Large Vision-Language Models [58.720142291102135]
We measure calibration across relevant variables like prompt, dataset, and architecture, and find that zero-shot inference with CLIP is miscalibrated. A single learned temperature generalizes for each specific CLIP model across inference dataset and prompt choice.
arXiv Detail & Related papers (2023-03-11T17:14:04Z)
Sample-dependent Adaptive Temperature Scaling for Improved Calibration [95.7477042886242]
Post-hoc approach to compensate for neural networks being wrong is to perform temperature scaling. We propose to predict a different temperature value for each input, allowing us to adjust the mismatch between confidence and accuracy. We test our method on the ResNet50 and WideResNet28-10 architectures using the CIFAR10/100 and Tiny-ImageNet datasets.
arXiv Detail & Related papers (2022-07-13T14:13:49Z)
Revisiting Calibration for Question Answering [16.54743762235555]
We argue that the traditional evaluation of calibration does not reflect usefulness of the model confidence. We propose a new calibration metric, MacroCE, that better captures whether the model assigns low confidence to wrong predictions and high confidence to correct predictions.
arXiv Detail & Related papers (2022-05-25T05:49:56Z)
Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration. We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z)
Uncertainty Quantification and Deep Ensembles [79.4957965474334]
We show that deep-ensembles do not necessarily lead to improved calibration properties. We show that standard ensembling methods, when used in conjunction with modern techniques such as mixup regularization, can lead to less calibrated models. This text examines the interplay between three of the most simple and commonly used approaches to leverage deep learning when data is scarce.
arXiv Detail & Related papers (2020-07-17T07:32:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.