Generalizability of Machine Learning Models: Quantitative Evaluation of
Three Methodological Pitfalls
- URL: http://arxiv.org/abs/2202.01337v1
- Date: Tue, 1 Feb 2022 05:07:27 GMT
- Title: Generalizability of Machine Learning Models: Quantitative Evaluation of
Three Methodological Pitfalls
- Authors: Farhad Maleki, Katie Ovens, Rajiv Gupta, Caroline Reinhold, Alan
Spatz, Reza Forghani
- Abstract summary: We implement random forest and deep convolutional neural network models using several medical imaging datasets.
We show that violation of the independence assumption could substantially affect model generalizability.
Inappropriate performance indicators could lead to erroneous conclusions.
- Score: 1.3870303451896246
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the great potential of machine learning, the lack of generalizability
has hindered the widespread adoption of these technologies in routine clinical
practice. We investigate three methodological pitfalls: (1) violation of
independence assumption, (2) model evaluation with an inappropriate performance
indicator, and (3) batch effect and how these pitfalls could affect the
generalizability of machine learning models. We implement random forest and
deep convolutional neural network models using several medical imaging
datasets, including head and neck CT, lung CT, chest X-Ray, and
histopathological images, to quantify and illustrate the effect of these
pitfalls. We develop these models with and without the pitfall and compare the
performance of the resulting models in terms of accuracy, precision, recall,
and F1 score. Our results showed that violation of the independence assumption
could substantially affect model generalizability. More specifically, (I)
applying oversampling before splitting data into train, validation and test
sets; (II) performing data augmentation before splitting data; (III)
distributing data points for a subject across training, validation, and test
sets; and (IV) applying feature selection before splitting data led to
superficial boosts in model performance. We also observed that inappropriate
performance indicators could lead to erroneous conclusions. Also, batch effect
could lead to developing models that lack generalizability. The aforementioned
methodological pitfalls lead to machine learning models with over-optimistic
performance. These errors, if made, cannot be captured using internal model
evaluation, and the inaccurate predictions made by the model may lead to wrong
conclusions and interpretations. Therefore, avoiding these pitfalls is a
necessary condition for developing generalizable models.
Related papers
- A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Evaluating the Fairness of Deep Learning Uncertainty Estimates in
Medical Image Analysis [3.5536769591744557]
Deep learning (DL) models have shown great success in many medical image analysis tasks.
However, deployment of the resulting models into real clinical contexts requires robustness and fairness across different sub-populations.
Recent studies have shown significant biases in DL models across demographic subgroups, indicating a lack of fairness in the models.
arXiv Detail & Related papers (2023-03-06T16:01:30Z) - A prediction and behavioural analysis of machine learning methods for
modelling travel mode choice [0.26249027950824505]
We conduct a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice.
Results indicate that the models with the highest disaggregate predictive performance provide poorer estimates of behavioural indicators and aggregate mode shares.
It is also observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.
arXiv Detail & Related papers (2023-01-11T11:10:32Z) - On the Generalization and Adaption Performance of Causal Models [99.64022680811281]
Differentiable causal discovery has proposed to factorize the data generating process into a set of modules.
We study the generalization and adaption performance of such modular neural causal models.
Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes.
arXiv Detail & Related papers (2022-06-09T17:12:32Z) - Statistical quantification of confounding bias in predictive modelling [0.0]
I propose the partial and full confounder tests, which probe the null hypotheses of unconfounded and fully confounded models.
The tests provide a strict control for Type I errors and high statistical power, even for non-normally and non-linearly dependent predictions.
arXiv Detail & Related papers (2021-11-01T10:35:24Z) - Probabilistic Modeling for Human Mesh Recovery [73.11532990173441]
This paper focuses on the problem of 3D human reconstruction from 2D evidence.
We recast the problem as learning a mapping from the input to a distribution of plausible 3D poses.
arXiv Detail & Related papers (2021-08-26T17:55:11Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - A comprehensive study on the prediction reliability of graph neural
networks for virtual screening [0.0]
We investigate the effects of model architectures, regularization methods, and loss functions on the prediction performance and reliability of classification results.
Our result highlights that correct choice of regularization and inference methods is evidently important to achieve high success rate.
arXiv Detail & Related papers (2020-03-17T10:13:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.