Adaptive Calibrator Ensemble for Model Calibration under Distribution
Shift
- URL: http://arxiv.org/abs/2303.05331v1
- Date: Thu, 9 Mar 2023 15:22:02 GMT
- Title: Adaptive Calibrator Ensemble for Model Calibration under Distribution
Shift
- Authors: Yuli Zou, Weijian Deng, Liang Zheng
- Abstract summary: adaptive calibrator ensemble (ACE) calibrates OOD datasets whose difficulty is usually higher than the calibration set.
ACE generally improves the performance of a few state-of-the-art calibration schemes on a series of OOD benchmarks.
- Score: 23.794897699193875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model calibration usually requires optimizing some parameters (e.g.,
temperature) w.r.t an objective function (e.g., negative log-likelihood). In
this paper, we report a plain, important but often neglected fact that the
objective function is influenced by calibration set difficulty, i.e., the ratio
of the number of incorrectly classified samples to that of correctly classified
samples. If a test set has a drastically different difficulty level from the
calibration set, the optimal calibration parameters of the two datasets would
be different. In other words, a calibrator optimal on the calibration set would
be suboptimal on the OOD test set and thus has degraded performance. With this
knowledge, we propose a simple and effective method named adaptive calibrator
ensemble (ACE) to calibrate OOD datasets whose difficulty is usually higher
than the calibration set. Specifically, two calibration functions are trained,
one for in-distribution data (low difficulty), and the other for severely OOD
data (high difficulty). To achieve desirable calibration on a new OOD dataset,
ACE uses an adaptive weighting method that strikes a balance between the two
extreme functions. When plugged in, ACE generally improves the performance of a
few state-of-the-art calibration schemes on a series of OOD benchmarks.
Importantly, such improvement does not come at the cost of the in-distribution
calibration accuracy.
Related papers
- Calibrating Language Models with Adaptive Temperature Scaling [58.056023173579625]
We introduce Adaptive Temperature Scaling (ATS), a post-hoc calibration method that predicts a temperature scaling parameter for each token prediction.
ATS improves calibration by over 10-50% across three downstream natural language evaluation benchmarks compared to prior calibration methods.
arXiv Detail & Related papers (2024-09-29T22:54:31Z) - Optimizing Calibration by Gaining Aware of Prediction Correctness [30.619608580138802]
Cross-Entropy (CE) loss is widely used for calibrator training, which enforces the model to increase confidence on the ground truth class.
We propose a new post-hoc calibration objective derived from the aim of calibration.
arXiv Detail & Related papers (2024-04-19T17:25:43Z) - Deep Ensemble Shape Calibration: Multi-Field Post-hoc Calibration in Online Advertising [8.441925127670308]
In the e-commerce advertising scenario, estimating the true probabilities (known as a calibrated estimate) on Click-Through Rate (CTR) and Conversion Rate (CVR) is critical.
Previous research has introduced numerous solutions for addressing the calibration problem.
We introduce innovative basis calibration functions, which enhance both function expression capabilities and data utilization.
arXiv Detail & Related papers (2024-01-17T11:41:11Z) - Causal isotonic calibration for heterogeneous treatment effects [0.5249805590164901]
We propose causal isotonic calibration, a novel nonparametric method for calibrating predictors of heterogeneous treatment effects.
We also introduce cross-calibration, a data-efficient variant of calibration that eliminates the need for hold-out calibration sets.
arXiv Detail & Related papers (2023-02-27T18:07:49Z) - A Unifying Theory of Distance from Calibration [9.959025631339982]
There is no consensus on how to quantify the distance from perfect calibration.
We propose a ground-truth notion of distance from calibration, inspired by the literature on property testing.
Applying our framework, we identify three calibration measures that are consistent and can be estimated efficiently.
arXiv Detail & Related papers (2022-11-30T10:38:24Z) - T-Cal: An optimal test for the calibration of predictive models [49.11538724574202]
We consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem.
detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions.
We propose T-Cal, a minimax test for calibration based on a de-biased plug-in estimator of the $ell$-Expected Error (ECE)
arXiv Detail & Related papers (2022-03-03T16:58:54Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z) - Uncertainty Quantification and Deep Ensembles [79.4957965474334]
We show that deep-ensembles do not necessarily lead to improved calibration properties.
We show that standard ensembling methods, when used in conjunction with modern techniques such as mixup regularization, can lead to less calibrated models.
This text examines the interplay between three of the most simple and commonly used approaches to leverage deep learning when data is scarce.
arXiv Detail & Related papers (2020-07-17T07:32:24Z) - Unsupervised Calibration under Covariate Shift [92.02278658443166]
We introduce the problem of calibration under domain shift and propose an importance sampling based approach to address it.
We evaluate and discuss the efficacy of our method on both real-world datasets and synthetic datasets.
arXiv Detail & Related papers (2020-06-29T21:50:07Z) - Calibration of Pre-trained Transformers [55.57083429195445]
We focus on BERT and RoBERTa in this work, and analyze their calibration across three tasks: natural language inference, paraphrase detection, and commonsense reasoning.
We show that: (1) when used out-of-the-box, pre-trained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5x lower; (2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain.
arXiv Detail & Related papers (2020-03-17T18:58:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.