Related papers: (Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy

(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy

URL: http://arxiv.org/abs/2306.00312v1
Date: Thu, 1 Jun 2023 03:22:15 GMT
Title: (Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy
Authors: Elan Rosenfeld, Saurabh Garg
Abstract summary: We derive an (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. In particular, our bound requires a simple, intuitive condition which is well justified by prior empirical works. We expect this loss can serve as a drop-in replacement for future methods which require maximizing multiclass disagreement.
Score: 8.010528849585937
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We derive an (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods either give bounds that are vacuous in practice or give estimates that are accurate on average but heavily underestimate error for a sizeable fraction of shifts. In particular, the latter only give guarantees based on complex continuous measures such as test calibration -- which cannot be identified without labels -- and are therefore unreliable. Instead, our bound requires a simple, intuitive condition which is well justified by prior empirical works and holds in practice effectively 100% of the time. The bound is inspired by $\mathcal{H}\Delta\mathcal{H}$-divergence but is easier to evaluate and substantially tighter, consistently providing non-vacuous guarantees. Estimating the bound requires optimizing one multiclass classifier to disagree with another, for which some prior works have used sub-optimal proxy losses; we devise a "disagreement loss" which is theoretically justified and performs better in practice. We expect this loss can serve as a drop-in replacement for future methods which require maximizing multiclass disagreement. Across a wide range of benchmarks, our method gives valid error bounds while achieving average accuracy comparable to competitive estimation baselines. Code is publicly available at https://github.com/erosenfeld/disagree_discrep .

Related papers

Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment [16.352863226512984]
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference.<n>Most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment.<n>We propose ADAPT, an Advanced Distribution-Aware and back propagation-free Test-time adaptation method.
arXiv Detail & Related papers (2025-08-21T13:42:49Z)
ODD: Overlap-aware Estimation of Model Performance under Distribution Shift [8.569585481097839]
Prior work uses disagreement discrepancy (DIS2) to derive practical error bounds under distribution shifts.<n>We devise Overlap-aware Disagreement Discrepancy (ODD)<n>Our ODD-based bound uses domain-classifiers to estimate domain-overlap and better predicts target performance than DIS2.
arXiv Detail & Related papers (2025-06-17T21:05:42Z)
Rethinking Early Stopping: Refine, Then Calibrate [49.966899634962374]
We show that calibration error and refinement error are not minimized simultaneously during training. We introduce a new metric for early stopping and hyper parameter tuning that makes it possible to minimize refinement error during training. Our method integrates seamlessly with any architecture and consistently improves performance across diverse classification tasks.
arXiv Detail & Related papers (2025-01-31T15:03:54Z)
Inference Scaling $\scriptsize\mathtt{F}$Laws: The Limits of LLM Resampling with Imperfect Verifiers [13.823743787003787]
Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models. We show that no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model. We also show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.
arXiv Detail & Related papers (2024-11-26T15:13:06Z)
Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning. One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems. We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
Dirichlet-Based Prediction Calibration for Learning with Noisy Labels [40.78497779769083]
Learning with noisy labels can significantly hinder the generalization performance of deep neural networks (DNNs) Existing approaches address this issue through loss correction or example selection methods. We propose the textitDirichlet-based Prediction (DPC) method as a solution.
arXiv Detail & Related papers (2024-01-13T12:33:04Z)
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning [59.44422468242455]
We propose a novel method dubbed ShrinkMatch to learn uncertain samples. For each uncertain sample, it adaptively seeks a shrunk class space, which merely contains the original top-1 class. We then impose a consistency regularization between a pair of strongly and weakly augmented samples in the shrunk space to strive for discriminative representations.
arXiv Detail & Related papers (2023-08-13T14:05:24Z)
Distribution-Free Inference for the Regression Function of Binary Classification [0.0]
The paper presents a resampling framework to construct exact, distribution-free and non-asymptotically guaranteed confidence regions for the true regression function for any user-chosen confidence level. It is proved that the constructed confidence regions are strongly consistent, that is, any false model is excluded in the long run with probability one.
arXiv Detail & Related papers (2023-08-03T15:52:27Z)
Is the Performance of My Deep Network Too Good to Be True? A Direct Approach to Estimating the Bayes Error in Binary Classification [86.32752788233913]
In classification problems, the Bayes error can be used as a criterion to evaluate classifiers with state-of-the-art performance. We propose a simple and direct Bayes error estimator, where we just take the mean of the labels that show emphuncertainty of the classes. Our flexible approach enables us to perform Bayes error estimation even for weakly supervised data.
arXiv Detail & Related papers (2022-02-01T13:22:26Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Shift Happens: Adjusting Classifiers [2.8682942808330703]
Minimizing expected loss measured by a proper scoring rule, such as Brier score or log-loss (cross-entropy), is a common objective while training a probabilistic classifier. We propose methods that transform all predictions to (re)equalize the average prediction and the class distribution. We demonstrate experimentally that, when in practice the class distribution is known only approximately, there is often still a reduction in loss depending on the amount of shift and the precision to which the class distribution is known.
arXiv Detail & Related papers (2021-11-03T21:27:27Z)
Tune it the Right Way: Unsupervised Validation of Domain Adaptation via Soft Neighborhood Density [125.64297244986552]
We propose an unsupervised validation criterion that measures the density of soft neighborhoods by computing the entropy of the similarity distribution between points. Our criterion is simpler than competing validation methods, yet more effective.
arXiv Detail & Related papers (2021-08-24T17:41:45Z)
Coping with Label Shift via Distributionally Robust Optimisation [72.80971421083937]
We propose a model that minimises an objective based on distributionally robust optimisation (DRO) We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective.
arXiv Detail & Related papers (2020-10-23T08:33:04Z)
Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm. Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z)
Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction [0.8594140167290097]
We develop conformal prediction methods for constructing valid confidence sets in multiclass and multilabel problems. By leveraging ideas from quantile regression, we build methods that always guarantee correct coverage but additionally provide conditional coverage for both multiclass and multilabel prediction problems.
arXiv Detail & Related papers (2020-04-21T17:45:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.