WILDS: A Benchmark of in-the-Wild Distribution Shifts
- URL: http://arxiv.org/abs/2012.07421v2
- Date: Tue, 9 Mar 2021 07:49:42 GMT
- Title: WILDS: A Benchmark of in-the-Wild Distribution Shifts
- Authors: Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin
Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas
Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton
A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma
Pierson, Sergey Levine, Chelsea Finn, Percy Liang
- Abstract summary: Distribution shifts can substantially degrade the accuracy of machine learning systems deployed in the wild.
We present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts.
We show that standard training results in substantially lower out-of-distribution than in-distribution performance.
- Score: 157.53410583509924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distribution shifts -- where the training distribution differs from the test
distribution -- can substantially degrade the accuracy of machine learning (ML)
systems deployed in the wild. Despite their ubiquity, these real-world
distribution shifts are under-represented in the datasets widely used in the ML
community today. To address this gap, we present WILDS, a curated collection of
8 benchmark datasets that reflect a diverse range of distribution shifts which
naturally arise in real-world applications, such as shifts across hospitals for
tumor identification; across camera traps for wildlife monitoring; and across
time and location in satellite imaging and poverty mapping. On each dataset, we
show that standard training results in substantially lower out-of-distribution
than in-distribution performance, and that this gap remains even with models
trained by existing methods for handling distribution shifts. This underscores
the need for new training methods that produce models which are more robust to
the types of distribution shifts that arise in practice. To facilitate method
development, we provide an open-source package that automates dataset loading,
contains default model architectures and hyperparameters, and standardizes
evaluations. Code and leaderboards are available at https://wilds.stanford.edu.
Related papers
- Class Distribution Shifts in Zero-Shot Learning: Learning Robust Representations [3.8980564330208662]
We propose a model that assumes that the attribute responsible for the shift is unknown in advance.
We show that our approach improves generalization on diverse class distributions in both simulations and real-world datasets.
arXiv Detail & Related papers (2023-11-30T14:14:31Z) - Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time [69.77704012415845]
Temporal shifts can considerably degrade performance of machine learning models deployed in the real world.
We benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.
Under both evaluation strategies, we observe an average performance drop of 20% from in-distribution to out-of-distribution data.
arXiv Detail & Related papers (2022-11-25T17:07:53Z) - Robust Calibration with Multi-domain Temperature Scaling [86.07299013396059]
We develop a systematic calibration model to handle distribution shifts by leveraging data from multiple domains.
Our proposed method -- multi-domain temperature scaling -- uses the robustness in the domains to improve calibration under distribution shift.
arXiv Detail & Related papers (2022-06-06T17:32:12Z) - Extending the WILDS Benchmark for Unsupervised Adaptation [186.90399201508953]
We present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data.
These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities.
We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods.
arXiv Detail & Related papers (2021-12-09T18:32:38Z) - Evaluating Predictive Uncertainty and Robustness to Distributional Shift
Using Real World Data [0.0]
We propose metrics for general regression tasks using the Shifts Weather Prediction dataset.
We also present an evaluation of the baseline methods using these metrics.
arXiv Detail & Related papers (2021-11-08T17:32:10Z) - Accuracy on the Line: On the Strong Correlation Between
Out-of-Distribution and In-Distribution Generalization [89.73665256847858]
We show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts.
Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet.
We also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS.
arXiv Detail & Related papers (2021-07-09T19:48:23Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.