Related papers: What Does It Take to Build a Performant Selective Classifier?

What Does It Take to Build a Performant Selective Classifier?

URL: http://arxiv.org/abs/2510.20242v2
Date: Fri, 24 Oct 2025 01:27:45 GMT
Title: What Does It Take to Build a Performant Selective Classifier?
Authors: Stephan Rabanser, Nicolas Papernot,
Abstract summary: Bayes noise, approximation error, ranking error, statistical noise, and implementation- or shift-induced slack are studied.<n>We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks.<n>Our results confirm that (i) Bayes noise and limited model capacity can account for substantial gaps, (ii) only richer, feature-aware calibrators meaningfully improve score ordering, and (iii) data shift introduces a separate slack that demands distributionally robust training.
Score: 30.90225954725644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Selective classifiers improve model reliability by abstaining on inputs the model deems uncertain. However, few practical approaches achieve the gold-standard performance of a perfect-ordering oracle that accepts examples exactly in order of correctness. Our work formalizes this shortfall as the selective-classification gap and present the first finite-sample decomposition of this gap to five distinct sources of looseness: Bayes noise, approximation error, ranking error, statistical noise, and implementation- or shift-induced slack. Crucially, our analysis reveals that monotone post-hoc calibration -- often believed to strengthen selective classifiers -- has limited impact on closing this gap, since it rarely alters the model's underlying score ranking. Bridging the gap therefore requires scoring mechanisms that can effectively reorder predictions rather than merely rescale them. We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks, isolating each error component through controlled experiments. Our results confirm that (i) Bayes noise and limited model capacity can account for substantial gaps, (ii) only richer, feature-aware calibrators meaningfully improve score ordering, and (iii) data shift introduces a separate slack that demands distributionally robust training. Together, our decomposition yields a quantitative error budget as well as actionable design guidelines that practitioners can use to build selective classifiers which approximate ideal oracle behavior more closely.

Related papers

CAOS: Conformal Aggregation of One-Shot Predictors [0.0]
One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example.<n>Standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor.<n>We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors.
arXiv Detail & Related papers (2026-01-08T18:44:21Z)
Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation [61.248535801314375]
Subset-Selected Counterfactual Augmentation (SS-CA)<n>We develop Counterfactual LIMA to identify minimal spatial region sets whose removal can selectively alter model predictions.<n>Experiments show that SS-CA improves generalization on in-distribution (ID) test data and achieves superior performance on out-of-distribution (OOD) benchmarks.
arXiv Detail & Related papers (2025-11-15T08:39:22Z)
A Novel Framework for Uncertainty Quantification via Proper Scores for Classification and Beyond [1.5229257192293202]
We propose a novel framework for uncertainty quantification in machine learning, which is based on proper scores.<n>Specifically, we use the kernel score, a kernel-based proper score, for evaluating sample-based generative models.<n>We generalize the calibration-sharpness decomposition beyond classification, which motivates the definition of proper calibration errors.
arXiv Detail & Related papers (2025-08-25T13:11:03Z)
Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments [5.5855749614100825]
This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction.<n>We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem.<n>Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.
arXiv Detail & Related papers (2025-05-25T23:17:47Z)
Rethinking Early Stopping: Refine, Then Calibrate [49.966899634962374]
We present a novel variational formulation of the calibration-refinement decomposition.<n>We provide theoretical and empirical evidence that calibration and refinement errors are not minimized simultaneously during training.
arXiv Detail & Related papers (2025-01-31T15:03:54Z)
Ask for More Than Bayes Optimal: A Theory of Indecisions for Classification [1.8434042562191815]
Selective classification is a powerful tool for automated decision-making in high-risk scenarios.<n>Our goal is to minimize the number of indecisions, which are observations that we do not automate.<n>By using indecisions, we are able to control the misclassification rate to any user-specified level, even below the Bayes optimal error rate.
arXiv Detail & Related papers (2024-12-17T11:25:51Z)
Improving Predictor Reliability with Selective Recalibration [15.319277333431318]
Recalibration is one of the most effective ways to produce reliable confidence estimates with a pre-trained model. We propose textitselective recalibration, where a selection model learns to reject some user-chosen proportion of the data. Our results show that selective recalibration consistently leads to significantly lower calibration error than a wide range of selection and recalibration baselines.
arXiv Detail & Related papers (2024-10-07T18:17:31Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
Variational Classification [51.2541371924591]
We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders. Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency. We induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer.
arXiv Detail & Related papers (2023-05-17T17:47:19Z)
Improving Adaptive Conformal Prediction Using Self-Supervised Learning [72.2614468437919]
We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores. We empirically demonstrate the benefit of the additional information using both synthetic and real data on the efficiency (width), deficit, and excess of conformal prediction intervals.
arXiv Detail & Related papers (2023-02-23T18:57:14Z)
Consistency Regularization for Certified Robustness of Smoothed Classifiers [89.72878906950208]
A recent technique of randomized smoothing has shown that the worst-case $ell$-robustness can be transformed into the average-case robustness. We found that the trade-off between accuracy and certified robustness of smoothed classifiers can be greatly controlled by simply regularizing the prediction consistency over noise.
arXiv Detail & Related papers (2020-06-07T06:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.