Related papers: What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability

What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability

URL: http://arxiv.org/abs/2205.11454v1
Date: Mon, 23 May 2022 16:45:02 GMT
Title: What is Your Metric Telling You? Evaluating Classifier Calibration under Context-Specific Definitions of Reliability
Authors: John Kirchenbauer and Jacob Oaks and Eric Heim
Abstract summary: We argue that more expressive metrics must be developed that accurately measure calibration error. We use a generalization of Expected Error (ECE) that measure calibration error under different definitions of reliability. We find that: 1) definitions ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics.
Score: 6.510061176722249
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Classifier calibration has received recent attention from the machine learning community due both to its practical utility in facilitating decision making, as well as the observation that modern neural network classifiers are poorly calibrated. Much of this focus has been towards the goal of learning classifiers such that their output with largest magnitude (the "predicted class") is calibrated. However, this narrow interpretation of classifier outputs does not adequately capture the variety of practical use cases in which classifiers can aid in decision making. In this work, we argue that more expressive metrics must be developed that accurately measure calibration error for the specific context in which a classifier will be deployed. To this end, we derive a number of different metrics using a generalization of Expected Calibration Error (ECE) that measure calibration error under different definitions of reliability. We then provide an extensive empirical evaluation of commonly used neural network architectures and calibration techniques with respect to these metrics. We find that: 1) definitions of ECE that focus solely on the predicted class fail to accurately measure calibration error under a selection of practically useful definitions of reliability and 2) many common calibration techniques fail to improve calibration performance uniformly across ECE metrics derived from these diverse definitions of reliability.

Related papers

Rethinking Early Stopping: Refine, Then Calibrate [49.966899634962374]
We show that calibration error and refinement error are not minimized simultaneously during training. We introduce a new metric for early stopping and hyper parameter tuning that makes it possible to minimize refinement error during training. Our method integrates seamlessly with any architecture and consistently improves performance across diverse classification tasks.
arXiv Detail & Related papers (2025-01-31T15:03:54Z)
Confidence Calibration of Classifiers with Many Classes [5.018156030818883]
For classification models based on neural networks, the maximum predicted class probability is often used as a confidence score. This score rarely predicts well the probability of making a correct prediction and requires a post-processing calibration step.
arXiv Detail & Related papers (2024-11-05T10:51:01Z)
Towards Certification of Uncertainty Calibration under Adversarial Attacks [96.48317453951418]
We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations. We propose novel calibration attacks and demonstrate how they can improve model calibration through textitadversarial calibration training
arXiv Detail & Related papers (2024-05-22T18:52:09Z)
From Uncertainty to Precision: Enhancing Binary Classifier Performance through Calibration [0.3495246564946556]
Given that model-predicted scores are commonly seen as event probabilities, calibration is crucial for accurate interpretation. We analyze the sensitivity of various calibration measures to score distortions and introduce a refined metric, the Local Score. We apply these findings in a real-world scenario using Random Forest classifier and regressor to predict credit default while simultaneously measuring calibration.
arXiv Detail & Related papers (2024-02-12T16:55:19Z)
Beyond Classification: Definition and Density-based Estimation of Calibration in Object Detection [15.71719154574049]
We tackle the challenge of defining and estimating calibration error for deep neural networks (DNNs) In particular, we adapt the definition of classification calibration error to handle the nuances associated with object detection. We propose a consistent and differentiable estimator of the detection calibration error, utilizing kernel density estimation.
arXiv Detail & Related papers (2023-12-11T18:57:05Z)
Calibration by Distribution Matching: Trainable Kernel Calibration Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression. These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization. We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z)
On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration. Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z)
Meta-Cal: Well-controlled Post-hoc Calibration by Ranking [23.253020991581963]
Post-hoc calibration is a technique to recalibrate a model, and its goal is to learn a calibration map. Existing approaches mostly focus on constructing calibration maps with low calibration errors. We study post-hoc calibration for multi-class classification under constraints, as a calibrator with a low calibration error does not necessarily mean it is useful in practice.
arXiv Detail & Related papers (2021-05-10T12:00:54Z)
Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration. We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z)
Unsupervised Calibration under Covariate Shift [92.02278658443166]
We introduce the problem of calibration under domain shift and propose an importance sampling based approach to address it. We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.
arXiv Detail & Related papers (2020-06-29T21:50:07Z)
Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions. We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test. Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.