On the Richness of Calibration
- URL: http://arxiv.org/abs/2302.04118v2
- Date: Sun, 14 May 2023 17:09:49 GMT
- Title: On the Richness of Calibration
- Authors: Benedikt H\"oltgen and Robert C Williamson
- Abstract summary: We make explicit the choices involved in designing calibration scores.
We organise these into three grouping choices and a choice concerning the agglomeration of group errors.
In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions.
We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.
- Score: 10.482805367361818
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Probabilistic predictions can be evaluated through comparisons with observed
label frequencies, that is, through the lens of calibration. Recent scholarship
on algorithmic fairness has started to look at a growing variety of
calibration-based objectives under the name of multi-calibration but has still
remained fairly restricted. In this paper, we explore and analyse forms of
evaluation through calibration by making explicit the choices involved in
designing calibration scores. We organise these into three grouping choices and
a choice concerning the agglomeration of group errors. This provides a
framework for comparing previously proposed calibration scores and helps to
formulate novel ones with desirable mathematical properties. In particular, we
explore the possibility of grouping datapoints based on their input features
rather than on predictions and formally demonstrate advantages of such
approaches. We also characterise the space of suitable agglomeration functions
for group errors, generalising previously proposed calibration scores.
Complementary to such population-level scores, we explore calibration scores at
the individual level and analyse their relationship to choices of grouping. We
draw on these insights to introduce and axiomatise fairness deviation measures
for population-level scores. We demonstrate that with appropriate choices of
grouping, these novel global fairness scores can provide notions of (sub-)group
or individual fairness.
Related papers
- Calibration by Distribution Matching: Trainable Kernel Calibration
Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression.
These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization.
We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z) - Towards Fair and Calibrated Models [26.74017047721052]
We work with a specific definition of fairness, which closely matches [Biswas et. al. 2019]
We show that an existing negative result towards achieving a fair and calibrated model does not hold for our definition of fairness.
We propose modifications of existing calibration losses to perform group-wise calibration, as a way of achieving fair and calibrated models.
arXiv Detail & Related papers (2023-10-16T13:41:09Z) - Mitigating Calibration Bias Without Fixed Attribute Grouping for
Improved Fairness in Medical Imaging Analysis [2.8943928153775826]
Cluster-Focal to first identify poorly calibrated samples, cluster them into groups, and then introduce group-wise focal loss to improve calibration bias.
We evaluate our method on skin lesion classification with the public HAM10000 dataset, and on predicting future lesional activity for multiple sclerosis (MS) patients.
arXiv Detail & Related papers (2023-07-04T14:14:12Z) - Matched Pair Calibration for Ranking Fairness [2.580183306478581]
We propose a test of fairness in score-based ranking systems called matched pair calibration.
We show how our approach generalizes the fairness intuitions of calibration from a binary classification setting to ranking.
arXiv Detail & Related papers (2023-06-06T15:32:30Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - Fair admission risk prediction with proportional multicalibration [0.16249424686052708]
Multicalibration constrains calibration error among flexibly-defined subpopulations.
It is possible for a decision-maker to learn to trust or distrust model predictions for specific groups.
We propose proportional multicalibration, a criteria that constrains the percent calibration error among groups and within prediction bins.
arXiv Detail & Related papers (2022-09-29T08:15:29Z) - Measuring Fairness Under Unawareness of Sensitive Attributes: A
Quantification-Based Approach [131.20444904674494]
We tackle the problem of measuring group fairness under unawareness of sensitive attributes.
We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem.
arXiv Detail & Related papers (2021-09-17T13:45:46Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Selective Classification Can Magnify Disparities Across Groups [89.14499988774985]
We find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities.
Increasing abstentions can even decrease accuracies on some groups.
We train distributionally-robust models that achieve similar full-coverage accuracies across groups and show that selective classification uniformly improves each group.
arXiv Detail & Related papers (2020-10-27T08:51:30Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.