Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time
- URL: http://arxiv.org/abs/2211.14238v1
- Date: Fri, 25 Nov 2022 17:07:53 GMT
- Title: Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time
- Authors: Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, Pang Wei Koh,
Chelsea Finn
- Abstract summary: Temporal shifts can considerably degrade performance of machine learning models deployed in the real world.
We benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.
Under both evaluation strategies, we observe an average performance drop of 20% from in-distribution to out-of-distribution data.
- Score: 69.77704012415845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distribution shift occurs when the test distribution differs from the
training distribution, and it can considerably degrade performance of machine
learning models deployed in the real world. Temporal shifts -- distribution
shifts arising from the passage of time -- often occur gradually and have the
additional structure of timestamp metadata. By leveraging timestamp metadata,
models can potentially learn from trends in past distribution shifts and
extrapolate into the future. While recent works have studied distribution
shifts, temporal shifts remain underexplored. To address this gap, we curate
Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts
arising in a variety of real-world applications, including patient prognosis
and news classification. On these datasets, we systematically benchmark 13
prior approaches, including methods in domain generalization, continual
learning, self-supervised learning, and ensemble learning. We use two
evaluation strategies: evaluation with a fixed time split (Eval-Fix) and
evaluation with a data stream (Eval-Stream). Eval-Fix, our primary evaluation
strategy, aims to provide a simple evaluation protocol, while Eval-Stream is
more realistic for certain real-world applications. Under both evaluation
strategies, we observe an average performance drop of 20% from in-distribution
to out-of-distribution data. Existing methods are unable to close this gap.
Code is available at https://wild-time.github.io/.
Related papers
- Temporal Test-Time Adaptation with State-Space Models [4.248760709042802]
Adapting a model on test samples can help mitigate this drop in performance.
Most test-time adaptation methods have focused on synthetic corruption shifts.
We propose STAD, a probabilistic state-space model that adapts a deployed model to temporal distribution shifts.
arXiv Detail & Related papers (2024-07-17T11:18:49Z) - Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy.
At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z) - Prompting-based Temporal Domain Generalization [10.377683220196873]
This paper presents a novel prompting-based approach to temporal domain generalization.
Our method adapts a trained model to temporal drift by learning global prompts, domain-specific prompts, and drift-aware prompts.
Experiments on classification, regression, and time series forecasting tasks demonstrate the generality of the proposed approach.
arXiv Detail & Related papers (2023-10-03T22:40:56Z) - AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly
Detection [7.829710051617368]
We introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection.
We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years.
We validate the performance degradation over time with diverse models, ranging from classical approaches to deep learning.
arXiv Detail & Related papers (2022-06-30T17:59:22Z) - Learning from Heterogeneous Data Based on Social Interactions over
Graphs [58.34060409467834]
This work proposes a decentralized architecture, where individual agents aim at solving a classification problem while observing streaming features of different dimensions.
We show that the.
strategy enables the agents to learn consistently under this highly-heterogeneous setting.
We show that the.
strategy enables the agents to learn consistently under this highly-heterogeneous setting.
arXiv Detail & Related papers (2021-12-17T12:47:18Z) - Extending the WILDS Benchmark for Unsupervised Adaptation [186.90399201508953]
We present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data.
These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities.
We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods.
arXiv Detail & Related papers (2021-12-09T18:32:38Z) - Evaluating Predictive Uncertainty and Robustness to Distributional Shift
Using Real World Data [0.0]
We propose metrics for general regression tasks using the Shifts Weather Prediction dataset.
We also present an evaluation of the baseline methods using these metrics.
arXiv Detail & Related papers (2021-11-08T17:32:10Z) - WILDS: A Benchmark of in-the-Wild Distribution Shifts [157.53410583509924]
Distribution shifts can substantially degrade the accuracy of machine learning systems deployed in the wild.
We present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts.
We show that standard training results in substantially lower out-of-distribution than in-distribution performance.
arXiv Detail & Related papers (2020-12-14T11:14:56Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.