Conformal prediction for the design problem
- URL: http://arxiv.org/abs/2202.03613v2
- Date: Thu, 10 Feb 2022 07:43:58 GMT
- Title: Conformal prediction for the design problem
- Authors: Clara Fannjiang, Stephen Bates, Anastasios Angelopoulos, Jennifer
Listgarten, Michael I. Jordan
- Abstract summary: In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
- Score: 72.14982816083297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many real-world deployments of machine learning, we use a prediction
algorithm to choose what data to test next. For example, in the protein design
problem, we have a regression model that predicts some real-valued property of
a protein sequence, which we use to propose new sequences believed to exhibit
higher property values than observed in the training data. Since validating
designed sequences in the wet lab is typically costly, it is important to know
how much we can trust the model's predictions. In such settings, however, there
is a distinct type of distribution shift between the training and test data:
one where the training and test data are statistically dependent, as the latter
is chosen based on the former. Consequently, the model's error on the test data
-- that is, the designed sequences -- has some non-trivial relationship with
its error on the training data. Herein, we introduce a method to quantify
predictive uncertainty in such settings. We do so by constructing confidence
sets for predictions that account for the dependence between the training and
test data. The confidence sets we construct have finite-sample guarantees that
hold for any prediction algorithm, even when a trained model chooses the
test-time input distribution. As a motivating use case, we demonstrate how our
method quantifies uncertainty for the predicted fitness of designed protein
using real data sets.
Related papers
- Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning [53.42244686183879]
Conformal prediction provides model-agnostic and distribution-free uncertainty quantification.
Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data.
We propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning.
arXiv Detail & Related papers (2024-10-13T15:37:11Z) - The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data.
Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem.
We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z) - Quantification of Predictive Uncertainty via Inference-Time Sampling [57.749601811982096]
We propose a post-hoc sampling strategy for estimating predictive uncertainty accounting for data ambiguity.
The method can generate different plausible outputs for a given input and does not assume parametric forms of predictive distributions.
arXiv Detail & Related papers (2023-08-03T12:43:21Z) - Robust Flow-based Conformal Inference (FCI) with Statistical Guarantee [4.821312633849745]
We develop a series of conformal inference methods, including building predictive sets and inferring outliers for complex and high-dimensional data.
We evaluate our method, robust flow-based conformal inference, on benchmark datasets.
arXiv Detail & Related papers (2022-05-22T04:17:30Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Improving Uncertainty Calibration via Prior Augmented Data [56.88185136509654]
Neural networks have proven successful at learning from complex data distributions by acting as universal function approximators.
They are often overconfident in their predictions, which leads to inaccurate and miscalibrated probabilistic predictions.
We propose a solution by seeking out regions of feature space where the model is unjustifiably overconfident, and conditionally raising the entropy of those predictions towards that of the prior distribution of the labels.
arXiv Detail & Related papers (2021-02-22T07:02:37Z) - Robust Validation: Confident Predictions Even When Distributions Shift [19.327409270934474]
We describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions.
We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an $f$-divergence ball around the training population.
An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it.
arXiv Detail & Related papers (2020-08-10T17:09:16Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z) - Stable Prediction with Model Misspecification and Agnostic Distribution
Shift [41.26323389341987]
In machine learning algorithms, two main assumptions are required to guarantee performance.
One is that the test data are drawn from the same distribution as the training data, and the other is that the model is correctly specified.
Under model misspecification, distribution shift between training and test data leads to inaccuracy of parameter estimation and instability of prediction across unknown test data.
We propose a novel Decorrelated Weighting Regression (DWR) algorithm which jointly optimize a variable decorrelation regularizer and a weighted regression model.
arXiv Detail & Related papers (2020-01-31T08:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.