Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of
Flaws and Benefits when Applying Over-sampling
- URL: http://arxiv.org/abs/2001.06296v2
- Date: Sat, 28 Nov 2020 16:41:03 GMT
- Title: Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of
Flaws and Benefits when Applying Over-sampling
- Authors: Gilles Vandewiele, Isabelle Dehaene, Gy\"orgy Kov\'acs, Lucas Sterckx,
Olivier Janssens, Femke Ongenae, Femke De Backere, Filip De Turck, Kristien
Roelens, Johan Decruyenaere, Sofie Van Hoecke, Thomas Demeester
- Abstract summary: We focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets.
We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified.
- Score: 13.463035357173045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information extracted from electrohysterography recordings could potentially
prove to be an interesting additional source of information to estimate the
risk on preterm birth. Recently, a large number of studies have reported
near-perfect results to distinguish between recordings of patients that will
deliver term or preterm using a public resource, called the Term/Preterm
Electrohysterogram database. However, we argue that these results are overly
optimistic due to a methodological flaw being made. In this work, we focus on
one specific type of methodological flaw: applying over-sampling before
partitioning the data into mutually exclusive training and testing sets. We
show how this causes the results to be biased using two artificial datasets and
reproduce results of studies in which this flaw was identified. Moreover, we
evaluate the actual impact of over-sampling on predictive performance, when
applied prior to data partitioning, using the same methodologies of related
studies, to provide a realistic view of these methodologies' generalization
capabilities. We make our research reproducible by providing all the code under
an open license.
Related papers
- A step towards the integration of machine learning and small area
estimation [0.0]
We propose a predictor supported by machine learning algorithms which can be used to predict any population or subpopulation characteristics.
We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well.
What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods.
arXiv Detail & Related papers (2024-02-12T09:43:17Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - Towards Assessing Data Bias in Clinical Trials [0.0]
Health care datasets can still be affected by data bias.
Data bias provides a distorted view of reality, leading to wrong analysis results and, consequently, decisions.
This paper proposes a method to address bias in datasets that: (i) defines the types of data bias that may be present in the dataset, (ii) characterizes and quantifies data bias with adequate metrics, and (iii) provides guidelines to identify, measure, and mitigate data bias for different data sources.
arXiv Detail & Related papers (2022-12-19T17:10:06Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Evaluating Causal Inference Methods [0.4588028371034407]
We introduce a deep generative model-based framework, Credence, to validate causal inference methods.
Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods.
arXiv Detail & Related papers (2022-02-09T00:21:22Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Increasing the efficiency of randomized trial estimates via linear
adjustment for a prognostic score [59.75318183140857]
Estimating causal effects from randomized experiments is central to clinical research.
Most methods for historical borrowing achieve reductions in variance by sacrificing strict type-I error rate control.
arXiv Detail & Related papers (2020-12-17T21:10:10Z) - Do We Really Sample Right In Model-Based Diagnosis? [0.0]
We study the representativeness of the produced samples in terms of their estimations about fault explanations.
We investigate the impact of sample size, the optimal trade-off between sampling efficiency and effectivity.
arXiv Detail & Related papers (2020-09-25T12:30:14Z) - Impact of Medical Data Imprecision on Learning Results [9.379890125442333]
We study the impact of imprecision on prediction results in a healthcare application.
A pre-trained model is used to predict future state of hyperthyroidism for patients.
arXiv Detail & Related papers (2020-07-24T06:54:57Z) - Enabling Counterfactual Survival Analysis with Balanced Representations [64.17342727357618]
Survival data are frequently encountered across diverse medical applications, i.e., drug development, risk profiling, and clinical trials.
We propose a theoretically grounded unified framework for counterfactual inference applicable to survival outcomes.
arXiv Detail & Related papers (2020-06-14T01:15:00Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.