Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics
Abstract Overview
This paper argues that Expected Calibration Error (ECE) is inadequate for assessing overconfidence risk because it can remain small even under arbitrarily large overconfidence risk. The authors introduce the Calibrated Size Ratio (CSR), an interpretable metric that equals 1 under perfect calibration, together with a z-score-based risk probability for overconfidence derived from a normal approximation. They also propose confidence-weighted accuracy (cwA) as a complementary measure of whether confidence scores meaningfully distinguish correct from incorrect predictions, and extend confidence weighting to standard classification metrics including AUC. The study combines theoretical analysis with experiments on synthetic confidence distributions (10 distributions × 8 calibration modes) and 15 real datasets, including raw, isotonic, and Platt-calibrated XGBoost outputs.
Novelty
The paper's main novelty is separating confidence evaluation into two distinct components: overconfidence risk via CSR (with a closed-form normal-approximation risk probability requiring no Monte Carlo resampling) and discriminative usefulness via confidence-weighted metrics such as cwA and cwAUC. It also proves that classical AUC is invariant to monotone recalibration, whereas cwAUC is sensitive to calibration through pairwise confidence weights, so cwAUC − AUC captures the discriminative value added by calibration.
Results
Across synthetic experiments (10 distributions, 8 calibration modes, 100 repetitions each), CSR stayed near 1 under perfect calibration with empirical false positive rates close to theoretical predictions, and P_risk reliably separated overconfident from calibrated or underconfident regimes. On 15 real datasets, isotonic calibration increased confidence-weighted performance on average (cwA = 0.8841) but drastically increased risk (10/15 datasets exceeding 3σ), while Platt scaling produced the safest profiles (0/15 exceeding 3σ, P_risk = 21.96%) with competitive confidence-weighted accuracy.
Key Points
- CSR is proposed as a calibration metric tied specifically to overconfidence risk, equaling 1 under perfect calibration, with a normal-approximation-based z-score yielding a risk probability P_risk that requires no Monte Carlo resampling and improves as N grows.
- cwA measures the fraction of total confidence mass assigned to correct predictions, and the confidence-weighting approach is proven to extend to all standard classification metrics via structural relations of the confidence-weighted confusion matrix, including a calibration-sensitive cwAUC.
- Empirical results demonstrate that low ECE does not guarantee safety (proven constructively via Proposition 5), and on real data isotonic calibration improves cwA but catastrophically increases overconfidence risk, while Platt scaling offers the safest profiles.