BEDS-Bench: Behavior of EHR-models under Distributional Shift--A
Benchmark
- URL: http://arxiv.org/abs/2107.08189v1
- Date: Sat, 17 Jul 2021 05:53:24 GMT
- Title: BEDS-Bench: Behavior of EHR-models under Distributional Shift--A
Benchmark
- Authors: Anand Avati, Martin Seneviratne, Emily Xue, Zhen Xu, Balaji
Lakshminarayanan and Andrew M. Dai
- Abstract summary: We release BEDS-Bench, a benchmark for quantifying the behavior of ML models over EHR data under OOD settings.
We evaluate several learning algorithms under BEDS-Bench and find that all of them show poor generalization performance under distributional shift in general.
- Score: 21.040754460129854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning has recently demonstrated impressive progress in predictive
accuracy across a wide array of tasks. Most ML approaches focus on
generalization performance on unseen data that are similar to the training data
(In-Distribution, or IND). However, real world applications and deployments of
ML rarely enjoy the comfort of encountering examples that are always IND. In
such situations, most ML models commonly display erratic behavior on
Out-of-Distribution (OOD) examples, such as assigning high confidence to wrong
predictions, or vice-versa. Implications of such unusual model behavior are
further exacerbated in the healthcare setting, where patient health can
potentially be put at risk. It is crucial to study the behavior and robustness
properties of models under distributional shift, understand common failure
modes, and take mitigation steps before the model is deployed. Having a
benchmark that shines light upon these aspects of a model is a first and
necessary step in addressing the issue. Recent work and interest in increasing
model robustness in OOD settings have focused more on image modality, while the
Electronic Health Record (EHR) modality is still largely under-explored. We aim
to bridge this gap by releasing BEDS-Bench, a benchmark for quantifying the
behavior of ML models over EHR data under OOD settings. We use two open access,
de-identified EHR datasets to construct several OOD data settings to run tests
on, and measure relevant metrics that characterize crucial aspects of a model's
OOD behavior. We evaluate several learning algorithms under BEDS-Bench and find
that all of them show poor generalization performance under distributional
shift in general. Our results highlight the need and the potential to improve
robustness of EHR models under distributional shift, and BEDS-Bench provides
one way to measure progress towards that goal.
Related papers
- MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models.
We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z) - Model Reprogramming Outperforms Fine-tuning on Out-of-distribution Data in Text-Image Encoders [56.47577824219207]
In this paper, we unveil the hidden costs associated with intrusive fine-tuning techniques.
We introduce a new model reprogramming approach for fine-tuning, which we name Reprogrammer.
Our empirical evidence reveals that Reprogrammer is less intrusive and yields superior downstream models.
arXiv Detail & Related papers (2024-03-16T04:19:48Z) - Think Twice: Measuring the Efficiency of Eliminating Prediction
Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature.
We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA)
We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z) - Guide the Learner: Controlling Product of Experts Debiasing Method Based
on Token Attribution Similarities [17.082695183953486]
A popular workaround is to train a robust model by re-weighting training examples based on a secondary biased model.
Here, the underlying assumption is that the biased model resorts to shortcut features.
We introduce a fine-tuning strategy that incorporates the similarity between the main and biased model attribution scores in a Product of Experts loss function.
arXiv Detail & Related papers (2023-02-06T15:21:41Z) - Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation)
We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others.
These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z) - SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in
Fine-tuned Source Code Models [58.78043959556283]
We study the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods.
Our analysis uncovers that LoRA fine-tuning consistently exhibits significantly better OOD generalization performance than full fine-tuning across various scenarios.
arXiv Detail & Related papers (2022-10-10T16:07:24Z) - How robust are pre-trained models to distribution shift? [82.08946007821184]
We show how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE)
We develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation.
arXiv Detail & Related papers (2022-06-17T16:18:28Z) - Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual
Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms.
We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance.
We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.