In Defense of Softmax Parametrization for Calibrated and Consistent
Learning to Defer
- URL: http://arxiv.org/abs/2311.01106v1
- Date: Thu, 2 Nov 2023 09:15:52 GMT
- Title: In Defense of Softmax Parametrization for Calibrated and Consistent
Learning to Defer
- Authors: Yuzhou Cao, Hussein Mozannar, Lei Feng, Hongxin Wei, Bo An
- Abstract summary: It has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring.
We show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax.
We propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness.
- Score: 27.025808709031864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling machine learning classifiers to defer their decision to a downstream
expert when the expert is more accurate will ensure improved safety and
performance. This objective can be achieved with the learning-to-defer
framework which aims to jointly learn how to classify and how to defer to the
expert. In recent studies, it has been theoretically shown that popular
estimators for learning to defer parameterized with softmax provide unbounded
estimates for the likelihood of deferring which makes them uncalibrated.
However, it remains unknown whether this is due to the widely used softmax
parameterization and if we can find a softmax-based estimator that is both
statistically consistent and possesses a valid probability estimator. In this
work, we first show that the cause of the miscalibrated and unbounded estimator
in prior literature is due to the symmetric nature of the surrogate losses used
and not due to softmax. We then propose a novel statistically consistent
asymmetric softmax-based surrogate loss that can produce valid estimates
without the issue of unboundedness. We further analyze the non-asymptotic
properties of our method and empirically validate its performance and
calibration on benchmark datasets.
Related papers
- Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts [78.3687645289918]
We show that the sigmoid gating function enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation.
We find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating.
arXiv Detail & Related papers (2024-05-22T21:12:34Z) - Uncertainty Estimation for Safety-critical Scene Segmentation via
Fine-grained Reward Maximization [12.79542334840646]
Uncertainty estimation plays an important role for future reliable deployment of deep segmentation models in safety-critical scenarios.
We propose a novel fine-grained reward (FGRM) framework to address uncertainty estimation.
Our method outperforms state-of-the-art methods by a clear margin on all the calibration metrics of uncertainty estimation.
arXiv Detail & Related papers (2023-11-05T17:43:37Z) - Calibrating Neural Simulation-Based Inference with Differentiable
Coverage Probability [50.44439018155837]
We propose to include a calibration term directly into the training objective of the neural model.
By introducing a relaxation of the classical formulation of calibration error we enable end-to-end backpropagation.
It is directly applicable to existing computational pipelines allowing reliable black-box posterior inference.
arXiv Detail & Related papers (2023-10-20T10:20:45Z) - Learning to Defer to Multiple Experts: Consistent Surrogate Losses,
Confidence Calibration, and Conformal Ensembles [0.966840768820136]
We study the statistical properties of learning to defer (L2D) to multiple experts.
We address the open problems of deriving a consistent surrogate loss, confidence calibration, and principled ensembling of experts.
arXiv Detail & Related papers (2022-10-30T21:27:29Z) - Revisiting Softmax for Uncertainty Approximation in Text Classification [45.07154956156555]
Uncertainty approximation in text classification is an important area with applications in domain adaptation and interpretability.
One of the most widely used uncertainty approximation methods is Monte Carlo (MC) Dropout, which is computationally expensive.
We compare softmax and an efficient version of MC Dropout on their uncertainty approximations and downstream text classification performance.
We find that, while MC dropout produces the best uncertainty approximations, using a simple softmax leads to competitive and in some cases better uncertainty estimation for text classification at a much lower computational cost.
arXiv Detail & Related papers (2022-10-25T14:13:53Z) - Theoretical characterization of uncertainty in high-dimensional linear
classification [24.073221004661427]
We show that uncertainty for learning from limited number of samples of high-dimensional input data and labels can be obtained by the approximate message passing algorithm.
We discuss how over-confidence can be mitigated by appropriately regularising, and show that cross-validating with respect to the loss leads to better calibration than with the 0/1 error.
arXiv Detail & Related papers (2022-02-07T15:32:07Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Tight Mutual Information Estimation With Contrastive Fenchel-Legendre
Optimization [69.07420650261649]
We introduce a novel, simple, and powerful contrastive MI estimator named as FLO.
Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently.
The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
arXiv Detail & Related papers (2021-07-02T15:20:41Z) - Improving Deterministic Uncertainty Estimation in Deep Learning for
Classification and Regression [30.112634874443494]
We propose a new model that estimates uncertainty in a single forward pass.
Our approach combines a bi-Lipschitz feature extractor with an inducing point approximate Gaussian process, offering robust and principled uncertainty estimation.
arXiv Detail & Related papers (2021-02-22T23:29:12Z) - Orthogonal Statistical Learning [49.55515683387805]
We provide non-asymptotic excess risk guarantees for statistical learning in a setting where the population risk depends on an unknown nuisance parameter.
We show that if the population risk satisfies a condition called Neymanity, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order.
arXiv Detail & Related papers (2019-01-25T02:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.