Related papers: Mandoline: Model Evaluation under Distribution Shift

Mandoline: Model Evaluation under Distribution Shift

URL: http://arxiv.org/abs/2107.00643v1
Date: Thu, 1 Jul 2021 17:57:57 GMT
Title: Mandoline: Model Evaluation under Distribution Shift
Authors: Mayee Chen, Karan Goel, Nimit Sohoni, Fait Poms, Kayvon Fatahalian, Christopher R\'e
Abstract summary: Machine learning models are often deployed in different settings than they were trained and validated on. We develop Mandoline, a new evaluation framework that mitigates these issues. Users write simple "slicing functions" - noisy, potentially correlated binary functions intended to capture possible axes of distribution shift.
Score: 8.007644303175395
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning models are often deployed in different settings than they were trained and validated on, posing a challenge to practitioners who wish to predict how well the deployed model will perform on a target distribution. If an unlabeled sample from the target distribution is available, along with a labeled sample from a possibly different source distribution, standard approaches such as importance weighting can be applied to estimate performance on the target. However, importance weighting struggles when the source and target distributions have non-overlapping support or are high-dimensional. Taking inspiration from fields such as epidemiology and polling, we develop Mandoline, a new evaluation framework that mitigates these issues. Our key insight is that practitioners may have prior knowledge about the ways in which the distribution shifts, which we can use to better guide the importance weighting procedure. Specifically, users write simple "slicing functions" - noisy, potentially correlated binary functions intended to capture possible axes of distribution shift - to compute reweighted performance estimates. We further describe a density ratio estimation framework for the slices and show how its estimation error scales with slice quality and dataset size. Empirical validation on NLP and vision tasks shows that \name can estimate performance on the target distribution up to $3\times$ more accurately compared to standard baselines.

Related papers

Distributional Training Data Attribution [20.18145179467698]
We introduce distributional training data attribution (d-TDA) to predict how the distribution of model outputs depends upon the dataset.<n>We identify training examples that drastically change the distribution of some target measurement without necessarily changing the mean.<n>We also find that influence functions (IFs) emerge naturally from our distributional framework as the limit to unrolled differentiation.
arXiv Detail & Related papers (2025-06-15T21:02:36Z)
Label Distribution Learning using the Squared Neural Family on the Probability Simplex [15.680835401104247]
We estimate a probability distribution of all possible label distributions over the simplex. With the modeled distribution, label distribution prediction can be achieved by performing the expectation operation. More information about the label distribution can be inferred, such as the prediction reliability and uncertainties.
arXiv Detail & Related papers (2024-12-10T09:12:02Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
Enhancing Robustness of Foundation Model Representations under Provenance-related Distribution Shifts [8.298173603769063]
We examine the stability of models based on foundation models under distribution shift. We focus on confounding by provenance, a form of distribution shift that emerges in the context of multi-institutional datasets. Results indicate that while foundation models do show some out-of-the-box robustness to confounding-by-provenance related distribution shifts, this can be improved through adjustment.
arXiv Detail & Related papers (2023-12-09T02:02:45Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Evaluating Predictive Uncertainty and Robustness to Distributional Shift Using Real World Data [0.0]
We propose metrics for general regression tasks using the Shifts Weather Prediction dataset. We also present an evaluation of the baseline methods using these metrics.
arXiv Detail & Related papers (2021-11-08T17:32:10Z)
Predicting with Confidence on Unseen Distributions [90.68414180153897]
We connect domain adaptation and predictive uncertainty literature to predict model accuracy on challenging unseen distributions. We find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts. We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference.
arXiv Detail & Related papers (2021-07-07T15:50:18Z)
Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)
Estimating Generalization under Distribution Shifts via Domain-Invariant Representations [75.74928159249225]
We use a set of domain-invariant predictors as a proxy for the unknown, true target labels. The error of the resulting risk estimate depends on the target risk of the proxy model.
arXiv Detail & Related papers (2020-07-06T17:21:24Z)
Calibrated Adversarial Refinement for Stochastic Semantic Segmentation [5.849736173068868]
We present a strategy for learning a calibrated predictive distribution over semantic maps, where the probability associated with each prediction reflects its ground truth correctness likelihood. We demonstrate the versatility and robustness of the approach by achieving state-of-the-art results on the multigrader LIDC dataset and on a modified Cityscapes dataset with injected ambiguities. We show that the core design can be adapted to other tasks requiring learning a calibrated predictive distribution by experimenting on a toy regression dataset.
arXiv Detail & Related papers (2020-06-23T16:39:59Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.