An Entropic Metric for Measuring Calibration of Machine Learning Models
- URL: http://arxiv.org/abs/2502.14545v1
- Date: Thu, 20 Feb 2025 13:21:18 GMT
- Title: An Entropic Metric for Measuring Calibration of Machine Learning Models
- Authors: Daniel James Sumler, Lee Devlin, Simon Maskell, Richard O. Lane,
- Abstract summary: We show how ECD may be applied to binary classification machine learning models.<n>Our metric distinguishes under- and over-confidence.<n>We demonstrate how this new metric performs on real and simulated data.
- Score: 2.467408627377504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the confidence with which a machine learning model classifies an input datum is an important, and perhaps under-investigated, concept. In this paper, we propose a new calibration metric, the Entropic Calibration Difference (ECD). Based on existing research in the field of state estimation, specifically target tracking (TT), we show how ECD may be applied to binary classification machine learning models. We describe the relative importance of under- and over-confidence and how they are not conflated in the TT literature. Indeed, our metric distinguishes under- from over-confidence. We consider this important given that algorithms that are under-confident are likely to be 'safer' than algorithms that are over-confident, albeit at the expense of also being over-cautious and so statistically inefficient. We demonstrate how this new metric performs on real and simulated data and compare with other metrics for machine learning model probability calibration, including the Expected Calibration Error (ECE) and its signed counterpart, the Expected Signed Calibration Error (ESCE).
Related papers
- A comprehensive review of classifier probability calibration metrics [0.0]
Probabilities or confidence values produced by AI andML models often do not reflect their true accuracy.
Probabilities calibration metrics measure the discrepancy between confidence and accuracy.
arXiv Detail & Related papers (2025-04-25T11:44:44Z) - Weak Supervision Performance Evaluation via Partial Identification [46.73061437177238]
Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels.
We present a novel method to address this challenge by framing model evaluation as a partial identification problem.
Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques.
arXiv Detail & Related papers (2023-12-07T07:15:11Z) - On the Calibration of Uncertainty Estimation in LiDAR-based Semantic
Segmentation [7.100396757261104]
We propose a metric to measure the confidence calibration quality of a semantic segmentation model with respect to individual classes.
We additionally suggest a double use for the method to automatically find label problems to improve the quality of hand- or auto-annotated datasets.
arXiv Detail & Related papers (2023-08-04T10:59:24Z) - TCE: A Test-Based Approach to Measuring Calibration Error [7.06037484978289]
We propose a new metric to measure the calibration error of probabilistic binary classifiers, called test-based calibration error (TCE)
TCE incorporates a novel loss function based on a statistical test to examine the extent to which model predictions differ from probabilities estimated from data.
We demonstrate properties of TCE through a range of experiments, including multiple real-world imbalanced datasets and ImageNet 1000.
arXiv Detail & Related papers (2023-06-25T21:12:43Z) - Variable-Based Calibration for Machine Learning Classifiers [11.9995808096481]
We introduce the notion of variable-based calibration to characterize calibration properties of a model.
We find that models with near-perfect expected calibration error can exhibit significant miscalibration as a function of features of the data.
arXiv Detail & Related papers (2022-09-30T00:49:31Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Don't Just Blame Over-parametrization for Over-confidence: Theoretical
Analysis of Calibration in Binary Classification [58.03725169462616]
We show theoretically that over-parametrization is not the only reason for over-confidence.
We prove that logistic regression is inherently over-confident, in the realizable, under-parametrized setting.
Perhaps surprisingly, we also show that over-confidence is not always the case.
arXiv Detail & Related papers (2021-02-15T21:38:09Z) - Learning from Similarity-Confidence Data [94.94650350944377]
We investigate a novel weakly supervised learning problem of learning from similarity-confidence (Sconf) data.
We propose an unbiased estimator of the classification risk that can be calculated from only Sconf data and show that the estimation error bound achieves the optimal convergence rate.
arXiv Detail & Related papers (2021-02-13T07:31:16Z) - Calibrated neighborhood aware confidence measure for deep metric
learning [0.0]
Deep metric learning has been successfully applied to problems in few-shot learning, image retrieval, and open-set classifications.
measuring the confidence of a deep metric learning model and identifying unreliable predictions is still an open challenge.
This paper focuses on defining a calibrated and interpretable confidence metric that closely reflects its classification accuracy.
arXiv Detail & Related papers (2020-06-08T21:05:38Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.