What is Your Metric Telling You? Evaluating Classifier Calibration under
Context-Specific Definitions of Reliability
- URL: http://arxiv.org/abs/2205.11454v1
- Date: Mon, 23 May 2022 16:45:02 GMT
- Title: What is Your Metric Telling You? Evaluating Classifier Calibration under
Context-Specific Definitions of Reliability
- Authors: John Kirchenbauer and Jacob Oaks and Eric Heim
- Abstract summary: We argue that more expressive metrics must be developed that accurately measure calibration error.
We use a generalization of Expected Error (ECE) that measure calibration error under different definitions of reliability.
We find that: 1) definitions ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics.
- Score: 6.510061176722249
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Classifier calibration has received recent attention from the machine
learning community due both to its practical utility in facilitating decision
making, as well as the observation that modern neural network classifiers are
poorly calibrated. Much of this focus has been towards the goal of learning
classifiers such that their output with largest magnitude (the "predicted
class") is calibrated. However, this narrow interpretation of classifier
outputs does not adequately capture the variety of practical use cases in which
classifiers can aid in decision making. In this work, we argue that more
expressive metrics must be developed that accurately measure calibration error
for the specific context in which a classifier will be deployed. To this end,
we derive a number of different metrics using a generalization of Expected
Calibration Error (ECE) that measure calibration error under different
definitions of reliability. We then provide an extensive empirical evaluation
of commonly used neural network architectures and calibration techniques with
respect to these metrics. We find that: 1) definitions of ECE that focus solely
on the predicted class fail to accurately measure calibration error under a
selection of practically useful definitions of reliability and 2) many common
calibration techniques fail to improve calibration performance uniformly across
ECE metrics derived from these diverse definitions of reliability.
Related papers
- Confidence Calibration of Classifiers with Many Classes [5.018156030818883]
For classification models based on neural networks, the maximum predicted class probability is often used as a confidence score.
This score rarely predicts well the probability of making a correct prediction and requires a post-processing calibration step.
arXiv Detail & Related papers (2024-11-05T10:51:01Z) - Towards Certification of Uncertainty Calibration under Adversarial Attacks [96.48317453951418]
We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations.
We propose novel calibration attacks and demonstrate how they can improve model calibration through textitadversarial calibration training
arXiv Detail & Related papers (2024-05-22T18:52:09Z) - From Uncertainty to Precision: Enhancing Binary Classifier Performance
through Calibration [0.3495246564946556]
Given that model-predicted scores are commonly seen as event probabilities, calibration is crucial for accurate interpretation.
We analyze the sensitivity of various calibration measures to score distortions and introduce a refined metric, the Local Score.
We apply these findings in a real-world scenario using Random Forest classifier and regressor to predict credit default while simultaneously measuring calibration.
arXiv Detail & Related papers (2024-02-12T16:55:19Z) - Beyond Classification: Definition and Density-based Estimation of
Calibration in Object Detection [15.71719154574049]
We tackle the challenge of defining and estimating calibration error for deep neural networks (DNNs)
In particular, we adapt the definition of classification calibration error to handle the nuances associated with object detection.
We propose a consistent and differentiable estimator of the detection calibration error, utilizing kernel density estimation.
arXiv Detail & Related papers (2023-12-11T18:57:05Z) - Calibration by Distribution Matching: Trainable Kernel Calibration
Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression.
These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization.
We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - Meta-Cal: Well-controlled Post-hoc Calibration by Ranking [23.253020991581963]
Post-hoc calibration is a technique to recalibrate a model, and its goal is to learn a calibration map.
Existing approaches mostly focus on constructing calibration maps with low calibration errors.
We study post-hoc calibration for multi-class classification under constraints, as a calibrator with a low calibration error does not necessarily mean it is useful in practice.
arXiv Detail & Related papers (2021-05-10T12:00:54Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z) - Unsupervised Calibration under Covariate Shift [92.02278658443166]
We introduce the problem of calibration under domain shift and propose an importance sampling based approach to address it.
We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.
arXiv Detail & Related papers (2020-06-29T21:50:07Z) - Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions.
We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test.
Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.